Use
of OBIEE-Stitch Joins:
The general approach to developing
with OBI is to push as much processing work to the database as possible.
The goal is generally to let the DB engine handle all the work, all within
itself, and reduce as much as possible any performance or latency hits
introduced by a network. In general this is correct.
However, there may be a point where
that philosophy becomes overkill.
When OBI generates SQL, there are a
number of options it has available to use as defined in the Features tab of
your database:
The specific scenario I’m going to
discuss is with a conformed query. A conformed query is simply the
merging together of two separate queries along a shared dimension. There
are a few options in the Features tab that have a large degree of influence
over how OBI handles a conformed query.
The Scenario
Consider the following simple model
of a conformed data model:
LTS stands for Logical Table Source;
forget about Logical Tables they don’t matter.
Then consider the following simple
query from Answers:
SELECT
Person.Name, LTS1.Metric1, LTS2.Metric2;
OBI will need to go after 2 result
sets and then merge them together:
#1:
SELECT
Person.Name, SUM(LTS1.Metric1)
FROM
Driver, LTS1
GROUP
BY Person.Name;
#2:
SELECT
Person.Name, SUM(LTS2.Metric2)
FROM
Driver, LTS2
GROUP
BY Person.Name;
Then the results of each query will
have to be merged together. It’s not quite as simple as you might first
expect, especially if there are more conformed fields than just
Person.Name. The tricky part deals with null values and different
datasets, which OBI assumes is always the case:
Example Query Fragment Result Sets:
Query
#1
|
Query
#2
|
Ann,
100Bob, 130
Chuck, 150
|
Bob, 1000Chuck, 1010
Dave,
1200
|
As you can see, the result sets are
different, so simply joining on the common field of Person.Name is inadequate –
some additional wrapper logic is needed to produce the following results:
Person.Name
|
Metric1
|
Metric2
|
Ann
|
100
|
|
Bob
|
130
|
1000
|
Chuck
|
150
|
1010
|
Dave
|
1200
|
What needs to occur is a FULL OUTER
JOIN to stitch the two together. And OUTER JOINS, let alone FULL OUTER
JOINS, are CPU-intensive operations. If there were more than one field, the
merging logic is even more complicated, as combinations of values with nulls
need to be considered. More on that later.
OBI’s SQL Generation Choices
The two base inner queries will
happen no matter how you configure OBI; it’s the data retrieval part of the
problem, and you have to get the data out of the raw tables somehow. Sure
there are some alterations and strange things (like the WITH syntax), but
basically those 2 queries have to get fired no matter what.
The merge/stitch/join potion however
is flexible, and is ultimately what this article is about. OBI has 2
choices:
a) The
Database can do the merge
b) OBI
can do the merge
There are 2 database features which
control where the merge will occur:
ROWNUM_SUPPORTED
and PERF_PREFER_INTERNAL_STITCH_JOIN. On many databases, OBI defaults the
ROWNUM_SUPPORTED feature to True and PERF_PREFER_INTERNAL_STITCH_JOIN to
False. Let’s take a look at the
effect of these 2 options:
ROWNUM_SUPPORTED
|
PERF_PREFER_INTERNAL_STITCH_JOIN
|
Join
Location
|
|
1
|
Yes
|
No
|
Database
|
2
|
Yes
|
Yes
|
OBI
|
3
|
No
|
No
|
OBI
|
4
|
No
|
Yes
|
OBI
|
When OBI does the stitching, it
simply fires off multiple, but simple, queries to the database, waits for each
to come back, and then does the merging on its own:
- Database Load Reduced
- Network Load Increased
- OBI Load Increased
When the database does the stitching,
the two base queries are wrapped together in a monster SQL statement with a
FULL OUTER JOIN linking them and a ton of additional value checking SQL in the
format of:
D1.c4
is not null then D1.c4 when D2.c2 is not null then D2.c2 end as c1,
- Database CPU Load Increased
- Network Load Reduced
- OBI Load Reduced
Let’s look at some examples to show
more specifically what I’m talking about.
Here is the first option, where the
database does the join. These are real queries from my current system,
but don’t focus on the internals of the black and blue text – it’s the red text
that is important. The SQL in black and blue is the core data access
code, the stuff that is basically the same regardless of how you glue them
together. Note this is a 3 column select with a few filters on it and a strange
piece that selects from Dual (ignore it):
select D1.c1 as c1,
D1.c2 as c2,
D1.c3 as c3
from
(select D1.c1 as c1,
D1.c2 as c2,
D1.c3 as c3,
D1.c4 as c4,
D1.c5 as c5
from
(select case
when D1.c4 is not null then D1.c4 when D2.c2 is not null then D2.c2 end
as c1,
D2.c1 as c2,
case when
case when D1.c3 is not null then D1.c3 when D2.c3 is not null then D2.c3
end = ‘Driver’ then D1.c1 else D1.c2 end as c3,
case when
D1.c3 is not null then D1.c3 when D2.c3 is not null then D2.c3 end as c4,
case when
D1.c5 is not null then D1.c5 when D2.c4 is not null then D2.c4 end as c5,
ROW_NUMBER() OVER
(PARTITION BY case when D1.c3 is not null then D1.c3 when D2.c3 is not
null then D2.c3 end , case when D1.c4 is not null then D1.c4 when D2.c2
is not null then D2.c2 end , case when D1.c5 is not null then D1.c5 when
D2.c4 is not null then D2.c4 end ORDER BY case when D1.c3 is not
null then D1.c3 when D2.c3 is not null then D2.c3 end ASC, case
when D1.c4 is not null then D1.c4 when D2.c2 is not null then D2.c2 end
ASC, case when D1.c5 is not null then D1.c5 when D2.c4 is not null then
D2.c4 end ASC) as c6
from
(select D1.c1 as c1,
D1.c2 as c2,
D1.c3 as c3,
D1.c4 as c4,
D1.c5 as c5
from
(select count(distinct
T10979.DRIVER_WID) as c1,
count(distinct 0) as c2,
T13395.ENTITY_TYPE as c3,
TRUNC(T804.DAY_DATE) as c4,
T804.DAY_WID as c5,
ROW_NUMBER() OVER (PARTITION BY
T804.DAY_WID, T13395.ENTITY_TYPE ORDER BY T804.DAY_WID ASC, T13395.ENTITY_TYPE
ASC) as c6
from
(select ‘Vehicle’ as
ENTITY_TYPE from dual union select ‘Driver’ from dual) T13395,
W_DAY_DIM T804 /* Date
(W_DAY_DIM) */ ,
W_CUSTOMER_DIM T806 /* Company
(W_CUSTOMER_DIM) */ ,
W_MEASURE_TYPE_DIM T867 /* Metric
(W_MEASURE_TYPE_DIM) */ ,
W_DRIVER_DAY_AGG T10979 /* Facts:
Measure DayDrv (W_DRIVER_DAY_AGG) */
where ( T804.DAY_WID =
T10979.DATE_WID and T806.CUSTOMER_WID = T10979.CUSTOMER_WID and
T806.CUSTOMER_WID = 188111.0 and T867.MEASURE_TYPE_WID = T10979.MEASURE_TYPE_WID
and T10979.CUSTOMER_WID = 188111.0 and T13395.ENTITY_TYPE = ‘Driver’ and
T804.DAY_DATE between TO_DATE(’2010-07-11 00:00:00′ , ‘YYYY-MM-DD HH24:MI:SS’)
and TO_DATE(’2010-07-25 00:00:00′ , ‘YYYY-MM-DD HH24:MI:SS’) )
group by T804.DAY_WID,
T13395.ENTITY_TYPE, TRUNC(T804.DAY_DATE)
) D1
where ( D1.c6 = 1 )
) D1
full outer join (select D1.c1 as c1,
D1.c2 as c2,
D1.c3 as c3,
D1.c4 as c4
from
(select
sum(T1021.MEASURE_COUNT) as c1,
TRUNC(T804.DAY_DATE)
as c2,
T13395.ENTITY_TYPE as
c3,
T804.DAY_WID as c4,
ROW_NUMBER() OVER
(PARTITION BY T804.DAY_WID, T13395.ENTITY_TYPE ORDER BY T804.DAY_WID ASC,
T13395.ENTITY_TYPE ASC) as c5
from
(select ‘Vehicle’ as
ENTITY_TYPE from dual union select ‘Driver’ from dual) T13395,
W_DAY_DIM T804 /* Date
(W_DAY_DIM) */ ,
W_CUSTOMER_DIM T806 /*
Company (W_CUSTOMER_DIM) */ ,
W_MEASURE_TYPE_DIM
T867 /* Metric (W_MEASURE_TYPE_DIM) */ ,
W_FLEET_AGG T1021 /*
Facts: Measure Fleet (W_FLEET_AGG) */
where (
T804.DAY_WID = T1021.DATE_WID and T806.CUSTOMER_WID = T1021.CUSTOMER_WID and
T806.CUSTOMER_WID = 188111.0 and T867.MEASURE_TYPE_WID = T1021.MEASURE_TYPE_WID
and T1021.CUSTOMER_WID = 188111.0 and T13395.ENTITY_TYPE = ‘Driver’ and
T804.DAY_DATE between TO_DATE(’2010-07-11 00:00:00′ , ‘YYYY-MM-DD HH24:MI:SS’)
and TO_DATE(’2010-07-25 00:00:00′ , ‘YYYY-MM-DD HH24:MI:SS’) )
group by T804.DAY_WID,
T13395.ENTITY_TYPE, TRUNC(T804.DAY_DATE)
) D1
where ( D1.c5 =
1 )
) D2 On D1.c3 = D2.c3 and D1.c5 = D2.c4
) D1
where ( D1.c6
= 1 )
) D1
order by c1
Now with
PERF_PREFER_INTERNAL_STITCH_JOIN = TRUE:
select D1.c4 as c1,
D1.c2 as c2,
D1.c3 as c3,
D1.c1 as c4,
D1.c5 as c5
from
(select D1.c1 as c1,
D1.c2 as c2,
D1.c3 as c3,
D1.c4 as c4,
D1.c5 as c5
from
(select count(distinct
T10979.DRIVER_WID) as c1,
count(distinct 0) as c2,
T13395.ENTITY_TYPE as c3,
TRUNC(T804.DAY_DATE) as c4,
T804.DAY_WID as c5,
ROW_NUMBER() OVER (PARTITION BY
T804.DAY_WID, T13395.ENTITY_TYPE ORDER BY T804.DAY_WID ASC, T13395.ENTITY_TYPE
ASC) as c6
from
(select ‘Vehicle’ as
ENTITY_TYPE from dual union select ‘Driver’ from dual) T13395,
W_DAY_DIM T804 /* Date
(W_DAY_DIM) */ ,
W_CUSTOMER_DIM T806 /* Company
(W_CUSTOMER_DIM) */ ,
W_MEASURE_TYPE_DIM T867 /* Metric
(W_MEASURE_TYPE_DIM) */ ,
W_DRIVER_DAY_AGG T10979 /* Facts:
Measure DayDrv (W_DRIVER_DAY_AGG) */
where ( T804.DAY_WID = T10979.DATE_WID
and T806.CUSTOMER_WID = T10979.CUSTOMER_WID and T806.CUSTOMER_WID = 188111.0
and T867.MEASURE_TYPE_WID = T10979.MEASURE_TYPE_WID and T10979.CUSTOMER_WID =
188111.0 and T13395.ENTITY_TYPE = ‘Driver’ and T804.DAY_DATE between
TO_DATE(’2010-07-11 00:00:00′ , ‘YYYY-MM-DD HH24:MI:SS’) and
TO_DATE(’2010-07-25 00:00:00′ , ‘YYYY-MM-DD HH24:MI:SS’) )
group by T804.DAY_WID,
T13395.ENTITY_TYPE, TRUNC(T804.DAY_DATE)
) D1
where ( D1.c6 = 1 )
) D1
order by c3, c5;
——————————————————————————————-
select D2.c2 as c1,
D2.c1 as c2,
D2.c3 as c3,
D2.c4 as c4
from
(select D1.c1 as c1,
D1.c2 as c2,
D1.c3 as c3,
D1.c4 as c4
from
(select
sum(T1021.MEASURE_COUNT) as c1,
TRUNC(T804.DAY_DATE)
as c2,
T13395.ENTITY_TYPE as
c3,
T804.DAY_WID as c4,
ROW_NUMBER() OVER (PARTITION
BY T804.DAY_WID, T13395.ENTITY_TYPE ORDER BY T804.DAY_WID ASC,
T13395.ENTITY_TYPE ASC) as c5
from
(select ‘Vehicle’ as
ENTITY_TYPE from dual union select ‘Driver’ from dual) T13395,
W_DAY_DIM T804 /* Date
(W_DAY_DIM) */ ,
W_CUSTOMER_DIM T806 /*
Company (W_CUSTOMER_DIM) */ ,
W_MEASURE_TYPE_DIM
T867 /* Metric (W_MEASURE_TYPE_DIM) */ ,
W_FLEET_AGG T1021 /*
Facts: Measure Fleet (W_FLEET_AGG) */
where (
T804.DAY_WID = T1021.DATE_WID and T806.CUSTOMER_WID = T1021.CUSTOMER_WID and
T806.CUSTOMER_WID = 188111.0 and T867.MEASURE_TYPE_WID = T1021.MEASURE_TYPE_WID
and T1021.CUSTOMER_WID = 188111.0 and T13395.ENTITY_TYPE = ‘Driver’ and
T804.DAY_DATE between TO_DATE(’2010-07-11 00:00:00′ , ‘YYYY-MM-DD HH24:MI:SS’)
and TO_DATE(’2010-07-25 00:00:00′ , ‘YYYY-MM-DD HH24:MI:SS’) )
group by T804.DAY_WID,
T13395.ENTITY_TYPE, TRUNC(T804.DAY_DATE)
) D1
where ( D1.c5 =
1 )
) D2
order by c3, c4;
As you can see, the heart of the
queries are identical, the difference is only in where the merging occurs.
One side note: some databases
do not like the ROW_NUMBER() function, so see if yours does. MPP
databases from my TeraData experience perform this operation very slowly.
I’ve also seen conflicting information from authors on whether it’s good or bad
to use ROW_NUMBER as it pertains to performance.
Without the ROWNUM_SUPPORTED
setting, the queries are even simpler:
select count(distinct
T10979.DRIVER_WID) as c1,
count(distinct 0) as c2,
T13395.ENTITY_TYPE as c3,
TRUNC(T804.DAY_DATE) as c4,
T804.DAY_WID as c5
from
(select ‘Vehicle’ as
ENTITY_TYPE from dual union select ‘Driver’ from dual) T13395,
W_DAY_DIM T804 /* Date (W_DAY_DIM)
*/ ,
W_CUSTOMER_DIM T806 /* Company
(W_CUSTOMER_DIM) */ ,
W_MEASURE_TYPE_DIM T867 /* Metric
(W_MEASURE_TYPE_DIM) */ ,
W_DRIVER_DAY_AGG T10979 /* Facts:
Measure DayDrv (W_DRIVER_DAY_AGG) */
where ( T804.DAY_WID =
T10979.DATE_WID and T806.CUSTOMER_WID = T10979.CUSTOMER_WID and
T806.CUSTOMER_WID = 188111.0 and T867.MEASURE_TYPE_WID =
T10979.MEASURE_TYPE_WID and T10979.CUSTOMER_WID = 188111.0 and T13395.ENTITY_TYPE
= ‘Driver’ and T804.DAY_DATE between TO_DATE(’2010-07-11 00:00:00′ ,
‘YYYY-MM-DD HH24:MI:SS’) and TO_DATE(’2010-07-25 00:00:00′ , ‘YYYY-MM-DD
HH24:MI:SS’) )
group by T804.DAY_WID,
T13395.ENTITY_TYPE, TRUNC(T804.DAY_DATE)
order by c5, c3
+++dev2:2a0000:2a0001:—-2010/11/03
06:08:29
——————– Sending query to database
named AM Data Mart (id: <<49984>>):
select
sum(T1021.MEASURE_COUNT) as c1,
TRUNC(T804.DAY_DATE) as
c2,
T13395.ENTITY_TYPE as c3,
T804.DAY_WID as c4
from
(select ‘Vehicle’ as
ENTITY_TYPE from dual union select ‘Driver’ from dual) T13395,
W_DAY_DIM T804 /* Date
(W_DAY_DIM) */ ,
W_CUSTOMER_DIM T806 /*
Company (W_CUSTOMER_DIM) */ ,
W_MEASURE_TYPE_DIM T867
/* Metric (W_MEASURE_TYPE_DIM) */ ,
W_FLEET_AGG T1021 /*
Facts: Measure Fleet (W_FLEET_AGG) */
where (
T804.DAY_WID = T1021.DATE_WID and T806.CUSTOMER_WID = T1021.CUSTOMER_WID and
T806.CUSTOMER_WID = 188111.0 and T867.MEASURE_TYPE_WID = T1021.MEASURE_TYPE_WID
and T1021.CUSTOMER_WID = 188111.0 and T13395.ENTITY_TYPE = ‘Driver’ and T804.DAY_DATE
between TO_DATE(’2010-07-11 00:00:00′ , ‘YYYY-MM-DD HH24:MI:SS’) and
TO_DATE(’2010-07-25 00:00:00′ , ‘YYYY-MM-DD HH24:MI:SS’) )
group by T804.DAY_WID,
T13395.ENTITY_TYPE, TRUNC(T804.DAY_DATE)
order by c4, c3
To Internally Stitch or Not?
You’re going to hate me for this,
but there is no single answer. It depends very much on a variety of
items. Personally I prefer the OBI stitch join syntax; it is simply
easier to read during development.
QA’ing the monster queries is akin to
how the crew read the black and green screen from the Matrix and made perfect
sense of it – not for the inexperienced. I am a firm believer that if the SQL is not perfect, it’s wrong.
So in order to debug/tune the SQL, this is the way to go. But of course what a
person prefers is not always what the technology prefers.
When I say perfect, I mean that the
proper function shipping and grouping is being pushed to the database. If
these are incorrect, you may end up with a situation where the query is
returning too many records to OBI across the network. The difference here
can be huge; in some cases your inner query may return a million rows or more
across the network for OBI to then write to disk and do its MAX()
computation.
With a large recordset, this is a deal breaker. With a
small recordset, you will be ok. This is why it is imperative to know
what your queries are doing and how many rows they return from the OBI log
file:
——————– Rows
15, bytes 1680 retrieved from database query id: <<49984>> —
Good
——————– Rows
150000, bytes 16800000 retrieved from database query id:
<<49984>> — Bad
Ok, so now I’m assuming you did your
RPD work and you have the correct core SQL being generated. Now it’s on
to determining where the stitch should occur.
Recently I had the opportunity to do
some load testing on our pre-production system.
We made 3 realizations that are
specific to our system (you may or may not have the same ones):
- The database CPU was the bottleneck doing nested loops for processing the stitching
- The inner SQL all returned only a few rows per query, usually under 10
- We had a tremendous amount of unused OBI Server CPU power
We these inputs, we switched over to
letting OBI do the stitching. The results were dramatic:
- At 100 concurrent users, the DB load factor went from 75 to 9 (16 CPU cores)
- Avg. report response time improved 9x. Remember, this is at load, not a single query on an idle box
- OBI Server load maxed out at 1% across 3 nodes even with stitching
These results speak for
themselves. A substantial improvement by changing one single checkbox!
Be Careful Of
There are a few things to be careful
of when you go from one technique to another – the direction doesn’t matter
much:
- Make sure your inner SQL is perfect – that is a prerequisite for this entire topic
- Regression test everything when you make any SQL generation switches. I have seen differences appear going from OBI to DB stitch and also the reverse. I cannot stress this enough.
- To resolve these differences, particularly when letting OBI do the stitching, is it is a little more strict than the DB stitching. Example: After letting OBI do the stitching, we tracked a discrepancy down to having an incorrect mid-level hierarchy key. We had defined the level key as State, and Oracle DB was fine with it. But OBI didn’t like it, and instead wanted us to make the level key to be Country & State. Expect to find some difficult RPD configuration issues to resolve.
- Nested Aggregations force your hand somewhat in that they frequently require ROW_NUMBER support in order to generate multi-level SQL (i.e., MAX(SUM())). Without it, the database will bring the results of the inner aggregation across the network to OBI – this could be millions of records. If this is the case and you want OBI to do the joins, make sure that your PERF_PREFER_INTERNAL_STITCH_JOIN is set to Yes.
Conclusion
For many systems, where the
stitching occurs will not be a problem. It depends on the complexity of
your queries, the type of hardware you have available, where your bottlenecks
are, the planned load and even metrics.
Furthermore, you may not see the
benefits of going one way or another until the system is at full user load
causing resource capacities and limits to be hit causing bottlenecks. I
strongly recommend not only a load test, but compare both options at load to
identify which option is best.
Of course before you do this it is
imperative that you regression test to make sure you get the same results – do
not make the mistake of assuming such a small configuration change can’t have
large impacts.
No comments:
Post a Comment