Monday, February 25, 2013

Fact Explosion Part 1


Some of the most frequent modeling challenges in a Higher Ed Data Warehouse has to do with multivalued dimensions and multivalued attributes. The truth is these are not unique to Higher Ed – they occur in every industry and virtually every data mart I’ve ever worked on. The use cases and solutions are also, technically, industry agnostic. What’s interesting in HE is just how pervasive this is in the Ellucian (Datatel) Colleague student system. But again, the modeling techniques are all well documented and fairly common place for a data architect. What I find a bit more challenging is how we represent this at the BI layer and educate our users on the various approaches, their benefits, and their risks.

Let’s start with a simple example report where we want to see ‘Total Financial Aid Amt’ by Student.

Table 1:
Name
FinancialAidAmt
Catherine
$10,000
Total
$10,000

Ok, as simple as it gets, right? 

What happens if we now want to include a student (repeating, multivalued) attribute, perhaps Major? (If we visualize the a potential star schema it may have a Student Dimension and Financial Aid Fact table. If the Student Dim is at the grain of 1 record per student we then have a nice one-to-many relationship between the Dimension and Fact table. If the Student Dimension is not at that grain, or perhaps joins to an outrigger table, and contains repeating rows for each student to accomodate multiple majors, then we now have a many-to-many from Dimension to Fact).

For example:

Table 2:
Name
Major
FinancialAidAmt
Catherine
English
$10,000
Catherine
History
$10,000
Total

$20,000


By simply introducing a multivaued attribute of student we experience a form of ‘fact explosion’ – incorrectly doubling of our Aid Amt measure – and a major reporting challenge. Our ‘Financial Aid Amt’ is not at the grain of student major (ie can’t break down Aid by major, it’s only at the Student, or Person, level).

Aside from the way I have aggregated the Total in Table 2, there is nothing inherently wrong with these data. There are few absolutes in modeling. If asked how one should handle such a scenario the best answer is always the same, “It depends”. Too often we look for the a cookie cutter solution, the “Best Practice”, but what’s more important is understanding what the goals and requirements are and having the tools, techniques, and knowledge base to apply the correct solution to fit the problem.

At a recent TDWI Conference I attended an advanced modeling session given by Chris Adamson (http://www.oaktonsoftware.com/) and he said something I thought was very well stated. He surveyed the audience and asked how many would consider themselves ETL developers, Business Analysts, DBAs, BI Developers, etc. He intentionally omitted a call for ‘Data Architects’ (DA) stating that one of the primary roles of the DA was to balance the shifting burdens of various technical challenges to the right layer (paraphrasing).

In the above example, there are techniques available to us at both the BI and data layers that we can use to fit the use case/s in our environments. Perhaps we’ll need to utilize several as there is rarely if ever only a single use case to solve for.

I’ll discuss the options in a future blog post.

No comments:

Post a Comment