Wednesday, March 13, 2013

Survey Data and the Warehouse

In a perfect world, survey data methodology would be consistent over time - the same questions, administration methodology, sample sizes, target population - but most of us don't live in that world. It's inevitable that a core survey be administered yearly, perhaps a New Student Survey, but over time questions inevitably are modified, new ones added, some removed, changes in how it's administered, etc. From one perspective, evolving improvements are a good thing. If we can make something better why would we not do that? Aside from the challenges this presents in terms of longitudinal statistical analysis, it also presents challenges to the Data Warehouse as there is an increasing demand to include these survey data into the Warehouse to marry with our more robust Dimensions providing the opportunity for much richer, broader, and deeper set of questions to be answered.

You'll likely not receive a warm response from any Data Architect or ETL developer when they hear that each year the table of survey data has the opportunity to change (# of columns, name of columns, response values, data types). This forces them to change table DDL, rebuild data structures, update the model, modify ETL packages. So, what to do? I've addressed this three different ways and I'm sure there are others.

(note: the diagrams below are really just conceptual in nature).

1. Traditional Dimensional Model Approach
Earlier in my career I felt more compelled to pray at the alter of "Best Practices" and would incorporate survey data into a traditional Star Schema design. Now, I'm more comfortable with bending those rules when it makes sense. A traditional approach works well if the incoming data are consistent and in standard predictable format. When questions change from year to year, for example, then this model really becomes challenged.




2. Multiple "Fact" Table Approach
This approach essentially addresses single fact table design shortcomings (see approach #1)  by creating a new data structure for each incoming survey while also maintaining the ability to join to some of our conformed dimensions. If you're using Business Objects of course you're now going to have to deal with a Universe with multiple contexts.



















3. Entity-Attribute-Value (EAV) Approach
Attribute-Value Pair design can be appealing to the ETL developer since it is built to accomodate any number of changes in the questions and response values because instead of storing them in columns, they are pivoted and become the values themselves in two simple data elements: Survey Question and Survey Response. This design can now acquire a variety of survey changes and even different surveys all together.





While this design is appealing for some use cases it's also a sort of rob Peter to pay Paul and pushes additional complexity to the report developer. Data in this format may feel unnatural to many and they often simply spend their time trying to "re-pivot" the data into a traditional row-column format. Also, the column holding all the question responses must inevitably be formated as text (varchar, long, etc) which will inhibit the behavior of any numeric response values (aggregations, custom functions) and require more lifting at the BI layer. Sometimes this design is appropriate in the backend staging area in order to load changing datasets but we're still left with some of the above challenges of having to, perhaps, completely drop and rebuild table structures and make updates at the BI layer.

If I could find a good way to use the EAV approach with BO Webi I would be more inclined to implement it. I'm curious to see if some other tools we're looking at can more easily accomodate the need for a re-pivot.



Tuesday, March 12, 2013

Fact Explosion Part 2


In the last post, Fact Explosion Part 1, I presented the problem of delivering fact (measure) information that is at a different grain than the surrounding dimensions due to such things as – multivalued dimensions or repeating attributes. Although from a modeling perspective, these two scenarios are different by definition, they often present the same front end reporting challenges.

Normally we aim for a nice one-to-many relationship between our dimension and fact tables.  Let’s say we have a Course Enrollment Fact table at the grain of 1 row per student, course, term. Of course we inevitably want to know who taught the course (instructor) but each Course Section can have more than 1 instructor so if we were to join an Instructor Dimension directly to the fact table it would be in a many-to-many relationship. Or, perhaps our Student Dimension must accommodate the fact that a student can, and often does, have double majors or more, resulting in a multivalued, repeating attribute design of our dimension.

So what are our options?

1.    Don’t allow the scenario to occur
This may be the best option for the masses. Create a “view” or Universe or semantic layer that keeps everything in check so the average user doesn’t report out incorrect numbers. Of course, this is not always possible….

Pros: Prevents incorrect aggregations and can be delivered to a larger audience without fear of incorrect reporting

Cons: Doesn’t always address real world reporting needs


2.    Deliver as is and push the challenge to the report developer
This option should only be considered if working with very experienced data and report writers, otherwise incorrect data aggregations could result.


Table 1: Incorrect Aggregation
Name
Major
FinancialAidAmt
Catherine
English
$10,000
Catherine
History
$10,000
Catherine Subtotal
$20,000
Joe
Poli Sci
$25,000
Joe Subtotal

$25,000
Gran Total

$45,000

Again, Catherine received $10,000 of Financial Aid total and is a double major. The Aid amount is not at the major grain. Simply using basic SUM aggregation may result in it appearing that she has $20,000 in aid. In order to prevent this we often need to rely on custom functions available to us in our BI tools – partition by student, break on student, section by student, creative use of AVG and SUM across rows, in BO such functions as FOR EACH, FOR ALL, IN, - all to prevent incorrect aggregation.

Table 2: Correct Aggregation
Name
Major
FinancialAidAmt
Catherine
English
$10,000
Catherine
History
$10,000
Catherine Subtotal
$10,000
Joe
Poli Sci
$25,000
Joe Subtotal

$25,000
Gran Total

$35,000


Pros: Users have more data elements to explore against various metrics

Cons: Greater likelihood of incorrect reporting

3.    Rows to Columns – Flatten the repeating attributes
This is a classic approach that works well in the right circumstances especially considering its limitations. When it works OK is when there are a relatively small and known number of possible values to pivot. For example, if our institution limits area of study to a max of two majors we have a small and know number of rows to pivot to column format. When it doesn’t work is if the number of values in a particular domain is large and hugely variable. This method also presents other challenges such as easily filtering all students of a particular major since the majors are now spread across multiple columns. Not impossible to solve, but more complicated.

Table 3:
Name
Major1
Major2
FinancialAidAmt
Catherine
English
History
$10,000
Catherine Subtotal

$10,000
Joe
Poli Sci
NA
$25,000
Joe Subtotal


$25,000
Gran Total


$35,000


Pros: Flattens the student dimensional attributes into a single row and decreases likelihood of incorrect aggregations at the reporting layers

Cons: Introduces difficulty in exploring data by major


4.    Concatenation of multivalued attributes
This is similar to our previous solution but it combines both major values into a single column rather than two.

Table 4:
Name
AllMajors
FinancialAidAmt
Catherine
English; History
$10,000
Catherine Subtotal
$10,000
Joe
Poli Sci
$5,000
Joe Subtotal

$25,000
Gran Total

$35,000


Pros: Similar to our previous example in that it flattens the student attributes into a single row minimizing fact explosion WHILE ALSO retaining the major information into a single column perhaps conducive to wild card filtering

Cons: Limits analysis by major due to concatenation


5.    Allocation
We can always allocate the total aid amount by student across the appropriate number of rows. This is a modeling technique using a weighting factor to determine the allocation percentage. For example, where Catherine has 2 rows (majors) and a total of $10,000 in aid we assign each of the two rows a value of $5,000.

Table 5:
Name
Major
FinancialAidAmt
Catherine
English
$5,000
Catherine
History
$5,000
Catherine Subtotal
$10,000
Joe
Poli Sci
$25,000
Joe Subtotal

$5,000
Gran Total

$35,000

Pros: Addresses the fact explosion issue by allocating the totals across rows so we can now continue to use basic SUM functions across rows.

Cons:  This can be confusing to some users and also becomes more expensive at the ETL layers and possible at run time depending on when/where the allocation calculations are taking place


So those are a few options for addressing these types of data challenges. Many require work from the Data Architect and ETL developers via SQL, creating appropriate data structures and processing, adjustments to the data models, bridge tables, etc.

What’s the best option? It depends on many factors – BI tool capabilities, end user skills, reporting use cases, and requirements. But ultimately, I use them all and often in concert with each other because there are always multiple business requirements that need to be solved for on any given project.



Monday, February 25, 2013

Fact Explosion Part 1


Some of the most frequent modeling challenges in a Higher Ed Data Warehouse has to do with multivalued dimensions and multivalued attributes. The truth is these are not unique to Higher Ed – they occur in every industry and virtually every data mart I’ve ever worked on. The use cases and solutions are also, technically, industry agnostic. What’s interesting in HE is just how pervasive this is in the Ellucian (Datatel) Colleague student system. But again, the modeling techniques are all well documented and fairly common place for a data architect. What I find a bit more challenging is how we represent this at the BI layer and educate our users on the various approaches, their benefits, and their risks.

Let’s start with a simple example report where we want to see ‘Total Financial Aid Amt’ by Student.

Table 1:
Name
FinancialAidAmt
Catherine
$10,000
Total
$10,000

Ok, as simple as it gets, right? 

What happens if we now want to include a student (repeating, multivalued) attribute, perhaps Major? (If we visualize the a potential star schema it may have a Student Dimension and Financial Aid Fact table. If the Student Dim is at the grain of 1 record per student we then have a nice one-to-many relationship between the Dimension and Fact table. If the Student Dimension is not at that grain, or perhaps joins to an outrigger table, and contains repeating rows for each student to accomodate multiple majors, then we now have a many-to-many from Dimension to Fact).

For example:

Table 2:
Name
Major
FinancialAidAmt
Catherine
English
$10,000
Catherine
History
$10,000
Total

$20,000


By simply introducing a multivaued attribute of student we experience a form of ‘fact explosion’ – incorrectly doubling of our Aid Amt measure – and a major reporting challenge. Our ‘Financial Aid Amt’ is not at the grain of student major (ie can’t break down Aid by major, it’s only at the Student, or Person, level).

Aside from the way I have aggregated the Total in Table 2, there is nothing inherently wrong with these data. There are few absolutes in modeling. If asked how one should handle such a scenario the best answer is always the same, “It depends”. Too often we look for the a cookie cutter solution, the “Best Practice”, but what’s more important is understanding what the goals and requirements are and having the tools, techniques, and knowledge base to apply the correct solution to fit the problem.

At a recent TDWI Conference I attended an advanced modeling session given by Chris Adamson (http://www.oaktonsoftware.com/) and he said something I thought was very well stated. He surveyed the audience and asked how many would consider themselves ETL developers, Business Analysts, DBAs, BI Developers, etc. He intentionally omitted a call for ‘Data Architects’ (DA) stating that one of the primary roles of the DA was to balance the shifting burdens of various technical challenges to the right layer (paraphrasing).

In the above example, there are techniques available to us at both the BI and data layers that we can use to fit the use case/s in our environments. Perhaps we’ll need to utilize several as there is rarely if ever only a single use case to solve for.

I’ll discuss the options in a future blog post.

Thursday, February 21, 2013

Collaborative Requirements

I used to work with a guy that was the Roger Ebert of the Business Requirements Document (BRD) critique. In a sick way I think he really looked forward to receiving them and then proceeded to slowly dissect and disintegrate any life form that contributed to their existence. He was a connoisseur and the rest of us were cheap can drinkers. Every BRD he received fell far short of his expectations.

The truth is, the entire system and process was doomed from the start. He relied on those unfamiliar with the DW space to go talk to users about there "needs" while he waited patiently scheduling cron jobs and scripting his latest stored procedure. A somewhat common scenario in the world of corporate silos that maybe relies to heavily on the BSA and separation between tech team and the business. This is a flawed approach and, thankfully, slowly going by the wayside.

If you're part of building the DW, take part, contribute. Roll your sleeves up and get dirty. Ensure you're invited to those critical meetings. BI doesn't succeed with tech geeks that can't understand the problem their trying to solve. We only exist to support the business and need to remember that and ensure we take part in that process. Don't forget about the first word in Business Intelligence.

If you're role is to develop or architect a solution then don't rely on the person with the BSA title to do this work alone. Collaborate on the effort as a team. This seems like less of a problem in HE versus the corporate world, at least where I am at. Maybe I'm lucky.

We all have skin in the game.


Tuesday, February 19, 2013

BI Portfolio & Standardization

In BI the term 'standardize' is often misused and too narrowly defined to mean ONE single tool or platform or report template. This is a false reality and one our user base will either outright reject or comply with frustration. As technologists, we too often want to become part of the story rather than just providing the tools to tell it. Standardization needs to not be narrowly defined as ONE single thing but a set of things (tools, conventions, templates, processess, etc) that can address each business use case.

Of course there is a fine line we straddle here from being streamlined to fragmentation. Most of us can speak to this in our every day lives as users of iPhone and Android devices. One'e greatest sell is providing a single consistent user experience. The other promotes customization and flexibility. A great debate, albeit out of scope for this post, but something we also face on our DW/BI projects. A standard consistent user experience can happily coexist with the flexibility to customize and adjust to changing business needs, though it certainly will require more thought, discussion, and caution.

This is facing us in my current project. We have purchased SAP Business Objects Edge platform and are building our user base and training offerings on the Webi Rich Client. There is this thought though that "Webi" is now our standard BI tool that everyone will use in all (ok most) situations. This is not reality. Webi does not fit every analytical use case nor should one try to make it do so.

We will aim to standardize at the Portfolio level not at the one Tool level around a robust set of options in:

  • Reporting
  • Ad hoc analysis
  • Data visualization, dashboards, mashups
  • MSFT Office integration

We'll be looking at SAPs offerings in each category against some additional players like QlikView, Tableau and Microsoft PowerPivot and PowerView.  It will raise the great Best-of-Breed vs Integrated Solution debate.

I'll update as we go.


Monday, February 18, 2013

Data Warehouse Architecture

One of the first decisions any Institution will have to make is what Data Warehouse architecture to commit to. I use the word 'commit' with intention. While this is hardly an all or nothing endeavor (few if any organization can claim to have any one single architecture) if you want to arrive somewhere it's important to have that goal in mind because there will be an enormous number of decisions to be made along the way.

Below is an overview of the major approaches to an 'Information Warehouse'. Two cautions, one these are just overviews and two I am taking a liberal view of the term "Data" or "Information" Warehouse (yes, I know many would not consider an ODS a DW, but let's just look at them all for sake of discussion).

1. Kimball
The classic Dimensional Model that most seem to associate with a DW. Here, data are acquired from their source system (ETL) and delivered into a target Dimensional design or Data Warehouse. Within the larger DW exist data marts of subject-are focused data sets which are brought together via the use of conformed dimensions. Our BI tools will then sit on top of the DW.

2. Inmon
Bill Inmon is credited with coining the term Corporate Information Factory and has also been, incorrectly, accused of being opposed to the Dimensional design Kimball is so famous for. He is not. Don't feel bad, for years I too was operating under this false premise assuming he was opposed to the Dimensional Model and advocated reporting from a model in 3NF (3rd normal form) - aka one similar to our source OLTP. We stopped listening at this point I guess.

The Corporate Information Factory does in fact contain data marts in the form of a Dimensional Model, fed from the Enterprise Data Warehouse (EDW),  it's just where (or at what stage in the entire data flow) these data are accessed that differentiates Inman with Kimball. Again, Inman advocates the EDW to be in 3NF and then feed into subject-are focused data marts (in dimensional format) while Kimball will allow the 'EDW' to be in dimensional format and accessed directly - simply 1 less degree of separation from the end user. These differences are likely only ever experienced by our Data Architects and ETL developers.

So why all the controversy? Who knows....

3. Departmental
Almost a natural extension of life in today's organization. Requests to IT are answered with not enough time, resources, or funding to address business needs leads to a proliferation of self-made, self-grown shadow systems that ultimately transform into 'production' ready lifelines that run our companies. 

Because of this, I am seeing a trend, particularly for consultants, away from the Enterprise Data Warehouse (EDW), especially in large organizations, where more departmentally focused information silos are being maintained and funded. Why? High Cost, diverse requirements, lower Total Cost of Ownership (TOC), quicker time to market, to name a few. This will be an interesting trend to watch. Perhaps more evidence towards distributed IT?

4. Operational Data Store (ODS)
The ODS is common in today's world as it's often quickly built and fairly easily administered. A lot of these are simply copies of our ERP's database over to a new reporting instance, perhaps from the nightly backup, with minor enhancements and a BI layer built on top. Viola.

This very well could be an attractive path for many organizations. Why?

Vendor ODS
a) It's often being sold to us. Literally. All of our ERP systems and most 3rd party software packages are shipping, for a small additional cost mind you, with their own pre-built data marts and associated BI layer to have you up and running quickly and with a small IT footprint. (this is why all of the big ERP vendors swallowed up major BI players). Ellucian sells this as part of their Colleague system using Business Objects as the reporting layer.

It's obvious why this is an attractive option. No big DW project, that spans months and years, with lots of resources costing lots of money. In the real world, I've never seen this dream become reality. In short time, our business users are asking more sophisticated questions that can not be handled by the vendor's  ODS, they want to make tweaks to canned reports that either can't be done or then go back into the IT queue (contrary to why we bought this things in the first place), and if those changes can be made we are at risk of losing support by our vendor because we've modified their product. Most, if not all, of these project that I've seen result in reworks of some kind or even a complete redirection.

It's sounds like I am totally opposed to this approach. I'm not. I do think there is a valid use case for a vendor shipped ODS but I would recommend going into it with eyes wide open being aware of its ultimate limitations and perhaps even committing to it sitting alongside a more traditional DW.

In-House ODS
b) A second type of ODS is largely the same as the one I've just described except it is developed and maintained by our own in-house resource. This has the benefit of providing a relatively quickly built reporting environment while also being, hopefully, more flexible and accepting of change because it is internally administered.

If you go the route of the ODS be keenly aware of what you're getting and what you're not. It may largely deliver basic reporting needs to our business but will struggle, if be able to handle at all, more advanced analytical queries such as historical longitudinal questions where a DW and OLTP take vastly different approaches to historical data.

5. Hybrid
This is a particularly interesting, and at times attractive, compromise between the traditional Dimensional Model and the ODS approach. With core, conformed dimensions centrally built and maintained (often the ONLY place a truly centralized, multi sourced Student or Faculty may exist, for example) they can be joined with more traditional operational (ERP, OLTP) table structures.

I use a Hybrid approach quite frequently for prototyping or POC (proof of concept) during projects before converting to a more traditional DW design. Though, I have as well delivered perfectly working models using this approach that are still in production today with happy users. I think this approach becomes more viable and palatable if we have already built our core conformed Dimensions which are then joined to operational data structures for smaller reporting needs.