Hadoop Integration with OBIEE 11g
The newest release of Oracle Business Intelligence 11.1.1.7 shows Oracle's continued efforts in
trying to integrate its Oracle Business Intelligence Platform with big data
technologies such as Hadoop and Hive. Specifically, I'm talking about OBIEE
11g's ability to integrate with a Hadoop Data source.
What is Hadoop?
Hadoop is a framework that enables
data to be distributed amongst many servers (nodes), commonly referred to
as a 'distributed file system'. The data is not stored in a single database,
rather it is spread across multiple clusters.
How does Hadoop process data stored
in multiple nodes?
Hadoop uses a programming model
called 'MapReduce' for parallel processing across multiple nodes. At a
high level this is comprised of two steps:
- Map step
- The map step takes the data, divides it into smaller sets of data and distributes the result to worker nodes
- Reduce step
- The reduce step collects the data from all of the worker nodes and aggregates it into a single 'output'
What is Hive?
MapReduce functions are generally
written in Java and generally require someone with deep knowledge in both
Hadoop and MapReduce. The guys over at facebook created a technology called
'Hive' which is a data warehouse infrastructure that sits on top of Hadoop.
More simply, Hive does the 'heavy lifting' of creating the MapReduce functions.
In order to query a Hadoop distributed file system, instead of having to write
MapReduce code, you generate sql-style code in a hive language called 'HQL'
Why does this matter in the Oracle Business Intelligence / Analytics space?
The analytics space is experiencing a shift in both technology and function. Traditional BI projects required a 'data warehouse' to store data in a series of star schemas (denormalized models) for quick query generation and data retrieval. The development and support of the data warehouse is achieved through a team of ETL developers whose main focus is to create the mappings that perform the data transformation from the source to the target.
Unless the functional requirements are clearly understood during this phase, value is usually lost in the data transformation and the potential to eliminate relevant data is certainly possible.
Using OBIEE 11g's Hadoop integration via a Hive ODBC, OBIEE can directly query distributed file systems via Hive. What does this mean? The potential now exists to eliminate or reduce the need for ETL as we now have the ability to directly query gigantic file systems.
The saving grace to ETL developers is that a need still exists for someone to create the HQL functions that populate the 'tables' that OBIEE uses. Ultimately, it could be a change in how ETL is developed.
Why does this matter in the Oracle Business Intelligence / Analytics space?
The analytics space is experiencing a shift in both technology and function. Traditional BI projects required a 'data warehouse' to store data in a series of star schemas (denormalized models) for quick query generation and data retrieval. The development and support of the data warehouse is achieved through a team of ETL developers whose main focus is to create the mappings that perform the data transformation from the source to the target.
Unless the functional requirements are clearly understood during this phase, value is usually lost in the data transformation and the potential to eliminate relevant data is certainly possible.
Using OBIEE 11g's Hadoop integration via a Hive ODBC, OBIEE can directly query distributed file systems via Hive. What does this mean? The potential now exists to eliminate or reduce the need for ETL as we now have the ability to directly query gigantic file systems.
The saving grace to ETL developers is that a need still exists for someone to create the HQL functions that populate the 'tables' that OBIEE uses. Ultimately, it could be a change in how ETL is developed.
How do you integrate OBIEE 11g with
Hadoop?
Step 1: Download the Hive ODBC
Drivers from http://support.oracle.com
You can reference Oracle Note 'Using Oracle Hadoop ODBC Driver with BI Administration Tool [ID 1520733.1]'
Step
2: Create a Hive ODBC Connection via the ODBC Data Source Administrator
Similiar to how you create an ODBC connection to edit the repository online,
you're going to create an ODBC connection but this time specify the driver as
'Oracle Apache Hadoop Hive WP Driver'
Once you've created the ODBC Data
Source Connection, you can configure the Driver set up under the 'General' tab:
Step 3: Configure Database
Connection
Moving into the repository, you're
going to create a new database connection like you would for any data source in
the physical layer. Note that you need to specify the database type as 'Apache
Hadoop' (this is important!).
Step 4: Create Connection Pool
Within the Apache Hadoop database
connection you just created in step 3, create a data source with a call
interface as type 'ODBC 2.0' or 'ODBC 3.5'. The data source call interface
should not be 'Apache Hadoop' (you've already specified the database as
type as Apache Hadoop!). If you specify the data source call interface as
'Apache Hadoop' you will receive the following error:
You should now be able to import
your tables and columns just like any other connection pool. The BI Server will
generate normal SQL statements as if it were querying a traditional Oracle
database, but the Hive ODBC driver in turn converts that to HQL which is used
to execute map reduce functions to query the Hardtop distributed file system
across multiple nodes.
No comments:
Post a Comment