Some of
the most frequent modeling challenges in a Higher Ed Data Warehouse has to do
with multivalued dimensions and multivalued attributes. The truth is these are not unique to Higher Ed – they occur in every industry and virtually every
data mart I’ve ever worked on. The use cases and solutions are also,
technically, industry agnostic. What’s interesting in HE is just how pervasive this is in the
Ellucian (Datatel) Colleague student system. But again, the modeling techniques
are all well documented and fairly common place for a data architect. What I find a bit more
challenging is how we represent this at the BI layer and educate our users on the
various approaches, their benefits, and their risks.
Let’s
start with a simple example report where we want to see ‘Total Financial Aid
Amt’ by Student.
Table 1:
Name
|
FinancialAidAmt
|
Catherine
|
$10,000
|
Total
|
$10,000
|
Ok, as
simple as it gets, right?
What
happens if we now want to include a student (repeating, multivalued) attribute, perhaps Major? (If we visualize the a potential star schema it may have a Student Dimension and Financial Aid Fact table. If the Student Dim is at the grain of 1 record per student we then have a nice one-to-many relationship between the Dimension and Fact table. If the Student Dimension is not at that grain, or perhaps joins to an outrigger table, and contains repeating rows for each student to accomodate multiple majors, then we now have a many-to-many from Dimension to Fact).
For example:
For example:
Table 2:
Name
|
Major
|
FinancialAidAmt
|
Catherine
|
English
|
$10,000
|
Catherine
|
History
|
$10,000
|
Total
|
$20,000
|
By simply
introducing a multivaued attribute of student we experience a form of ‘fact explosion’ – incorrectly
doubling of our Aid Amt measure – and a major reporting challenge. Our
‘Financial Aid Amt’ is not at the grain of student major (ie can’t break down
Aid by major, it’s only at the Student, or Person, level).
Aside
from the way I have aggregated the Total in Table 2, there is nothing inherently
wrong with these data. There are few absolutes in modeling. If asked how one should handle such a scenario the best answer is always the same, “It
depends”. Too often we look for the a cookie cutter solution, the “Best
Practice”, but what’s more important is understanding what the goals and
requirements are and having the tools, techniques, and knowledge base to apply
the correct solution to fit the problem.
At a
recent TDWI Conference I attended an advanced modeling session given by Chris
Adamson (http://www.oaktonsoftware.com/) and he said something I thought was
very well stated. He surveyed the audience and asked how many would consider
themselves ETL developers, Business Analysts, DBAs, BI Developers, etc. He
intentionally omitted a call for ‘Data Architects’ (DA) stating that one of the
primary roles of the DA was to balance the shifting burdens of various
technical challenges to the right layer (paraphrasing).
In the
above example, there are techniques available to us at both the BI and data
layers that we can use to fit the use case/s in our environments. Perhaps we’ll
need to utilize several as there is rarely if ever only a single use case to
solve for.
I’ll
discuss the options in a future blog post.