Wednesday, March 13, 2013

Survey Data and the Warehouse

In a perfect world, survey data methodology would be consistent over time - the same questions, administration methodology, sample sizes, target population - but most of us don't live in that world. It's inevitable that a core survey be administered yearly, perhaps a New Student Survey, but over time questions inevitably are modified, new ones added, some removed, changes in how it's administered, etc. From one perspective, evolving improvements are a good thing. If we can make something better why would we not do that? Aside from the challenges this presents in terms of longitudinal statistical analysis, it also presents challenges to the Data Warehouse as there is an increasing demand to include these survey data into the Warehouse to marry with our more robust Dimensions providing the opportunity for much richer, broader, and deeper set of questions to be answered.

You'll likely not receive a warm response from any Data Architect or ETL developer when they hear that each year the table of survey data has the opportunity to change (# of columns, name of columns, response values, data types). This forces them to change table DDL, rebuild data structures, update the model, modify ETL packages. So, what to do? I've addressed this three different ways and I'm sure there are others.

(note: the diagrams below are really just conceptual in nature).

1. Traditional Dimensional Model Approach
Earlier in my career I felt more compelled to pray at the alter of "Best Practices" and would incorporate survey data into a traditional Star Schema design. Now, I'm more comfortable with bending those rules when it makes sense. A traditional approach works well if the incoming data are consistent and in standard predictable format. When questions change from year to year, for example, then this model really becomes challenged.




2. Multiple "Fact" Table Approach
This approach essentially addresses single fact table design shortcomings (see approach #1)  by creating a new data structure for each incoming survey while also maintaining the ability to join to some of our conformed dimensions. If you're using Business Objects of course you're now going to have to deal with a Universe with multiple contexts.



















3. Entity-Attribute-Value (EAV) Approach
Attribute-Value Pair design can be appealing to the ETL developer since it is built to accomodate any number of changes in the questions and response values because instead of storing them in columns, they are pivoted and become the values themselves in two simple data elements: Survey Question and Survey Response. This design can now acquire a variety of survey changes and even different surveys all together.





While this design is appealing for some use cases it's also a sort of rob Peter to pay Paul and pushes additional complexity to the report developer. Data in this format may feel unnatural to many and they often simply spend their time trying to "re-pivot" the data into a traditional row-column format. Also, the column holding all the question responses must inevitably be formated as text (varchar, long, etc) which will inhibit the behavior of any numeric response values (aggregations, custom functions) and require more lifting at the BI layer. Sometimes this design is appropriate in the backend staging area in order to load changing datasets but we're still left with some of the above challenges of having to, perhaps, completely drop and rebuild table structures and make updates at the BI layer.

If I could find a good way to use the EAV approach with BO Webi I would be more inclined to implement it. I'm curious to see if some other tools we're looking at can more easily accomodate the need for a re-pivot.



Tuesday, March 12, 2013

Fact Explosion Part 2


In the last post, Fact Explosion Part 1, I presented the problem of delivering fact (measure) information that is at a different grain than the surrounding dimensions due to such things as – multivalued dimensions or repeating attributes. Although from a modeling perspective, these two scenarios are different by definition, they often present the same front end reporting challenges.

Normally we aim for a nice one-to-many relationship between our dimension and fact tables.  Let’s say we have a Course Enrollment Fact table at the grain of 1 row per student, course, term. Of course we inevitably want to know who taught the course (instructor) but each Course Section can have more than 1 instructor so if we were to join an Instructor Dimension directly to the fact table it would be in a many-to-many relationship. Or, perhaps our Student Dimension must accommodate the fact that a student can, and often does, have double majors or more, resulting in a multivalued, repeating attribute design of our dimension.

So what are our options?

1.    Don’t allow the scenario to occur
This may be the best option for the masses. Create a “view” or Universe or semantic layer that keeps everything in check so the average user doesn’t report out incorrect numbers. Of course, this is not always possible….

Pros: Prevents incorrect aggregations and can be delivered to a larger audience without fear of incorrect reporting

Cons: Doesn’t always address real world reporting needs


2.    Deliver as is and push the challenge to the report developer
This option should only be considered if working with very experienced data and report writers, otherwise incorrect data aggregations could result.


Table 1: Incorrect Aggregation
Name
Major
FinancialAidAmt
Catherine
English
$10,000
Catherine
History
$10,000
Catherine Subtotal
$20,000
Joe
Poli Sci
$25,000
Joe Subtotal

$25,000
Gran Total

$45,000

Again, Catherine received $10,000 of Financial Aid total and is a double major. The Aid amount is not at the major grain. Simply using basic SUM aggregation may result in it appearing that she has $20,000 in aid. In order to prevent this we often need to rely on custom functions available to us in our BI tools – partition by student, break on student, section by student, creative use of AVG and SUM across rows, in BO such functions as FOR EACH, FOR ALL, IN, - all to prevent incorrect aggregation.

Table 2: Correct Aggregation
Name
Major
FinancialAidAmt
Catherine
English
$10,000
Catherine
History
$10,000
Catherine Subtotal
$10,000
Joe
Poli Sci
$25,000
Joe Subtotal

$25,000
Gran Total

$35,000


Pros: Users have more data elements to explore against various metrics

Cons: Greater likelihood of incorrect reporting

3.    Rows to Columns – Flatten the repeating attributes
This is a classic approach that works well in the right circumstances especially considering its limitations. When it works OK is when there are a relatively small and known number of possible values to pivot. For example, if our institution limits area of study to a max of two majors we have a small and know number of rows to pivot to column format. When it doesn’t work is if the number of values in a particular domain is large and hugely variable. This method also presents other challenges such as easily filtering all students of a particular major since the majors are now spread across multiple columns. Not impossible to solve, but more complicated.

Table 3:
Name
Major1
Major2
FinancialAidAmt
Catherine
English
History
$10,000
Catherine Subtotal

$10,000
Joe
Poli Sci
NA
$25,000
Joe Subtotal


$25,000
Gran Total


$35,000


Pros: Flattens the student dimensional attributes into a single row and decreases likelihood of incorrect aggregations at the reporting layers

Cons: Introduces difficulty in exploring data by major


4.    Concatenation of multivalued attributes
This is similar to our previous solution but it combines both major values into a single column rather than two.

Table 4:
Name
AllMajors
FinancialAidAmt
Catherine
English; History
$10,000
Catherine Subtotal
$10,000
Joe
Poli Sci
$5,000
Joe Subtotal

$25,000
Gran Total

$35,000


Pros: Similar to our previous example in that it flattens the student attributes into a single row minimizing fact explosion WHILE ALSO retaining the major information into a single column perhaps conducive to wild card filtering

Cons: Limits analysis by major due to concatenation


5.    Allocation
We can always allocate the total aid amount by student across the appropriate number of rows. This is a modeling technique using a weighting factor to determine the allocation percentage. For example, where Catherine has 2 rows (majors) and a total of $10,000 in aid we assign each of the two rows a value of $5,000.

Table 5:
Name
Major
FinancialAidAmt
Catherine
English
$5,000
Catherine
History
$5,000
Catherine Subtotal
$10,000
Joe
Poli Sci
$25,000
Joe Subtotal

$5,000
Gran Total

$35,000

Pros: Addresses the fact explosion issue by allocating the totals across rows so we can now continue to use basic SUM functions across rows.

Cons:  This can be confusing to some users and also becomes more expensive at the ETL layers and possible at run time depending on when/where the allocation calculations are taking place


So those are a few options for addressing these types of data challenges. Many require work from the Data Architect and ETL developers via SQL, creating appropriate data structures and processing, adjustments to the data models, bridge tables, etc.

What’s the best option? It depends on many factors – BI tool capabilities, end user skills, reporting use cases, and requirements. But ultimately, I use them all and often in concert with each other because there are always multiple business requirements that need to be solved for on any given project.