In the
last post, Fact Explosion Part 1, I presented the problem of delivering fact
(measure) information that is at a different grain than the surrounding
dimensions due to such things as – multivalued dimensions or repeating
attributes. Although from a modeling perspective, these two scenarios are
different by definition, they often present the same front end reporting
challenges.
Normally
we aim for a nice one-to-many relationship between our dimension and fact
tables. Let’s say we have a Course
Enrollment Fact table at the grain of 1 row per student, course, term. Of
course we inevitably want to know who taught the course (instructor) but each
Course Section can have more than 1 instructor so if we were to join an
Instructor Dimension directly to the fact table it would be in a many-to-many
relationship. Or, perhaps our Student Dimension must accommodate the fact that
a student can, and often does, have double majors or more, resulting in a multivalued, repeating attribute design of our dimension.
So what
are our options?
1.
Don’t allow the scenario to occur
This may be the best option for the masses. Create a
“view” or Universe or semantic layer that keeps everything in check so the
average user doesn’t report out incorrect numbers. Of course, this is not
always possible….
Pros:
Prevents incorrect aggregations and can be delivered to a larger audience
without fear of incorrect reporting
Cons: Doesn’t
always address real world reporting needs
2.
Deliver as is and push the
challenge to the report developer
This option should only be considered if working with very
experienced data and report writers, otherwise incorrect data aggregations could
result.
Table 1: Incorrect
Aggregation
Name
|
Major
|
FinancialAidAmt
|
Catherine
|
English
|
$10,000
|
Catherine
|
History
|
$10,000
|
Catherine
Subtotal
|
$20,000
|
|
Joe
|
Poli Sci
|
$25,000
|
Joe
Subtotal
|
$25,000
|
|
Gran
Total
|
$45,000
|
Again, Catherine received $10,000 of Financial Aid total
and is a double major. The Aid amount is not at the major grain. Simply using
basic SUM aggregation may result in it appearing that she has $20,000 in aid.
In order to prevent this we often need to rely on custom functions available to
us in our BI tools – partition by student, break on student, section by
student, creative use of AVG and SUM across rows, in BO such functions as FOR
EACH, FOR ALL, IN, - all to prevent incorrect aggregation.
Table 2: Correct
Aggregation
Name
|
Major
|
FinancialAidAmt
|
Catherine
|
English
|
$10,000
|
Catherine
|
History
|
$10,000
|
Catherine
Subtotal
|
$10,000
|
|
Joe
|
Poli Sci
|
$25,000
|
Joe
Subtotal
|
$25,000
|
|
Gran
Total
|
$35,000
|
Pros:
Users have more data elements to explore against various metrics
Cons: Greater
likelihood of incorrect reporting
3.
Rows to Columns – Flatten the
repeating attributes
This is a classic approach that works well in the right
circumstances especially considering its limitations. When it works OK is when
there are a relatively small and known number of possible values to pivot. For
example, if our institution limits area of study to a max of two majors we have
a small and know number of rows to pivot to column format. When it doesn’t work
is if the number of values in a particular domain is large and hugely variable.
This method also presents other challenges such as easily filtering all
students of a particular major since the majors are now spread across multiple
columns. Not impossible to solve, but more complicated.
Table 3:
Name
|
Major1
|
Major2
|
FinancialAidAmt
|
|
Catherine
|
English
|
History
|
$10,000
|
|
Catherine
Subtotal
|
$10,000
|
|||
Joe
|
Poli Sci
|
NA
|
$25,000
|
|
Joe
Subtotal
|
$25,000
|
|||
Gran
Total
|
$35,000
|
Pros:
Flattens the student dimensional attributes into a single row and decreases
likelihood of incorrect aggregations at the reporting layers
Cons:
Introduces difficulty in exploring data by major
4.
Concatenation of multivalued attributes
This is similar to our previous solution but it combines
both major values into a single column rather than two.
Table 4:
Name
|
AllMajors
|
FinancialAidAmt
|
Catherine
|
English; History
|
$10,000
|
Catherine
Subtotal
|
$10,000
|
|
Joe
|
Poli Sci
|
$5,000
|
Joe
Subtotal
|
$25,000
|
|
Gran
Total
|
$35,000
|
Pros: Similar to our previous example in that it flattens
the student attributes into a single row minimizing fact explosion WHILE ALSO
retaining the major information into a single column perhaps conducive to wild
card filtering
Cons: Limits analysis by major due to concatenation
5.
Allocation
We can always allocate the total aid amount by student
across the appropriate number of rows. This is a modeling technique using a
weighting factor to determine the allocation percentage. For example, where
Catherine has 2 rows (majors) and a total of $10,000 in aid we assign each of
the two rows a value of $5,000.
Table 5:
Name
|
Major
|
FinancialAidAmt
|
Catherine
|
English
|
$5,000
|
Catherine
|
History
|
$5,000
|
Catherine
Subtotal
|
$10,000
|
|
Joe
|
Poli Sci
|
$25,000
|
Joe
Subtotal
|
$5,000
|
|
Gran
Total
|
$35,000
|
Pros: Addresses the fact explosion issue by allocating the
totals across rows so we can now continue to use basic SUM functions across
rows.
Cons: This can be
confusing to some users and also becomes more expensive at the ETL layers and
possible at run time depending on when/where the allocation calculations are
taking place
So those are a few options for addressing these types of
data challenges. Many require work from the Data Architect and ETL developers via
SQL, creating appropriate data structures and processing, adjustments to the
data models, bridge tables, etc.
What’s the best option? It depends on many factors – BI tool
capabilities, end user skills, reporting use cases, and requirements. But
ultimately, I use them all and often in concert with each other because there are
always multiple business requirements that need to be solved for on any given project.
No comments:
Post a Comment