Tuesday, March 12, 2013

Fact Explosion Part 2


In the last post, Fact Explosion Part 1, I presented the problem of delivering fact (measure) information that is at a different grain than the surrounding dimensions due to such things as – multivalued dimensions or repeating attributes. Although from a modeling perspective, these two scenarios are different by definition, they often present the same front end reporting challenges.

Normally we aim for a nice one-to-many relationship between our dimension and fact tables.  Let’s say we have a Course Enrollment Fact table at the grain of 1 row per student, course, term. Of course we inevitably want to know who taught the course (instructor) but each Course Section can have more than 1 instructor so if we were to join an Instructor Dimension directly to the fact table it would be in a many-to-many relationship. Or, perhaps our Student Dimension must accommodate the fact that a student can, and often does, have double majors or more, resulting in a multivalued, repeating attribute design of our dimension.

So what are our options?

1.    Don’t allow the scenario to occur
This may be the best option for the masses. Create a “view” or Universe or semantic layer that keeps everything in check so the average user doesn’t report out incorrect numbers. Of course, this is not always possible….

Pros: Prevents incorrect aggregations and can be delivered to a larger audience without fear of incorrect reporting

Cons: Doesn’t always address real world reporting needs


2.    Deliver as is and push the challenge to the report developer
This option should only be considered if working with very experienced data and report writers, otherwise incorrect data aggregations could result.


Table 1: Incorrect Aggregation
Name
Major
FinancialAidAmt
Catherine
English
$10,000
Catherine
History
$10,000
Catherine Subtotal
$20,000
Joe
Poli Sci
$25,000
Joe Subtotal

$25,000
Gran Total

$45,000

Again, Catherine received $10,000 of Financial Aid total and is a double major. The Aid amount is not at the major grain. Simply using basic SUM aggregation may result in it appearing that she has $20,000 in aid. In order to prevent this we often need to rely on custom functions available to us in our BI tools – partition by student, break on student, section by student, creative use of AVG and SUM across rows, in BO such functions as FOR EACH, FOR ALL, IN, - all to prevent incorrect aggregation.

Table 2: Correct Aggregation
Name
Major
FinancialAidAmt
Catherine
English
$10,000
Catherine
History
$10,000
Catherine Subtotal
$10,000
Joe
Poli Sci
$25,000
Joe Subtotal

$25,000
Gran Total

$35,000


Pros: Users have more data elements to explore against various metrics

Cons: Greater likelihood of incorrect reporting

3.    Rows to Columns – Flatten the repeating attributes
This is a classic approach that works well in the right circumstances especially considering its limitations. When it works OK is when there are a relatively small and known number of possible values to pivot. For example, if our institution limits area of study to a max of two majors we have a small and know number of rows to pivot to column format. When it doesn’t work is if the number of values in a particular domain is large and hugely variable. This method also presents other challenges such as easily filtering all students of a particular major since the majors are now spread across multiple columns. Not impossible to solve, but more complicated.

Table 3:
Name
Major1
Major2
FinancialAidAmt
Catherine
English
History
$10,000
Catherine Subtotal

$10,000
Joe
Poli Sci
NA
$25,000
Joe Subtotal


$25,000
Gran Total


$35,000


Pros: Flattens the student dimensional attributes into a single row and decreases likelihood of incorrect aggregations at the reporting layers

Cons: Introduces difficulty in exploring data by major


4.    Concatenation of multivalued attributes
This is similar to our previous solution but it combines both major values into a single column rather than two.

Table 4:
Name
AllMajors
FinancialAidAmt
Catherine
English; History
$10,000
Catherine Subtotal
$10,000
Joe
Poli Sci
$5,000
Joe Subtotal

$25,000
Gran Total

$35,000


Pros: Similar to our previous example in that it flattens the student attributes into a single row minimizing fact explosion WHILE ALSO retaining the major information into a single column perhaps conducive to wild card filtering

Cons: Limits analysis by major due to concatenation


5.    Allocation
We can always allocate the total aid amount by student across the appropriate number of rows. This is a modeling technique using a weighting factor to determine the allocation percentage. For example, where Catherine has 2 rows (majors) and a total of $10,000 in aid we assign each of the two rows a value of $5,000.

Table 5:
Name
Major
FinancialAidAmt
Catherine
English
$5,000
Catherine
History
$5,000
Catherine Subtotal
$10,000
Joe
Poli Sci
$25,000
Joe Subtotal

$5,000
Gran Total

$35,000

Pros: Addresses the fact explosion issue by allocating the totals across rows so we can now continue to use basic SUM functions across rows.

Cons:  This can be confusing to some users and also becomes more expensive at the ETL layers and possible at run time depending on when/where the allocation calculations are taking place


So those are a few options for addressing these types of data challenges. Many require work from the Data Architect and ETL developers via SQL, creating appropriate data structures and processing, adjustments to the data models, bridge tables, etc.

What’s the best option? It depends on many factors – BI tool capabilities, end user skills, reporting use cases, and requirements. But ultimately, I use them all and often in concert with each other because there are always multiple business requirements that need to be solved for on any given project.



No comments:

Post a Comment