Wednesday, March 13, 2013

Survey Data and the Warehouse

In a perfect world, survey data methodology would be consistent over time - the same questions, administration methodology, sample sizes, target population - but most of us don't live in that world. It's inevitable that a core survey be administered yearly, perhaps a New Student Survey, but over time questions inevitably are modified, new ones added, some removed, changes in how it's administered, etc. From one perspective, evolving improvements are a good thing. If we can make something better why would we not do that? Aside from the challenges this presents in terms of longitudinal statistical analysis, it also presents challenges to the Data Warehouse as there is an increasing demand to include these survey data into the Warehouse to marry with our more robust Dimensions providing the opportunity for much richer, broader, and deeper set of questions to be answered.

You'll likely not receive a warm response from any Data Architect or ETL developer when they hear that each year the table of survey data has the opportunity to change (# of columns, name of columns, response values, data types). This forces them to change table DDL, rebuild data structures, update the model, modify ETL packages. So, what to do? I've addressed this three different ways and I'm sure there are others.

(note: the diagrams below are really just conceptual in nature).

1. Traditional Dimensional Model Approach
Earlier in my career I felt more compelled to pray at the alter of "Best Practices" and would incorporate survey data into a traditional Star Schema design. Now, I'm more comfortable with bending those rules when it makes sense. A traditional approach works well if the incoming data are consistent and in standard predictable format. When questions change from year to year, for example, then this model really becomes challenged.




2. Multiple "Fact" Table Approach
This approach essentially addresses single fact table design shortcomings (see approach #1)  by creating a new data structure for each incoming survey while also maintaining the ability to join to some of our conformed dimensions. If you're using Business Objects of course you're now going to have to deal with a Universe with multiple contexts.



















3. Entity-Attribute-Value (EAV) Approach
Attribute-Value Pair design can be appealing to the ETL developer since it is built to accomodate any number of changes in the questions and response values because instead of storing them in columns, they are pivoted and become the values themselves in two simple data elements: Survey Question and Survey Response. This design can now acquire a variety of survey changes and even different surveys all together.





While this design is appealing for some use cases it's also a sort of rob Peter to pay Paul and pushes additional complexity to the report developer. Data in this format may feel unnatural to many and they often simply spend their time trying to "re-pivot" the data into a traditional row-column format. Also, the column holding all the question responses must inevitably be formated as text (varchar, long, etc) which will inhibit the behavior of any numeric response values (aggregations, custom functions) and require more lifting at the BI layer. Sometimes this design is appropriate in the backend staging area in order to load changing datasets but we're still left with some of the above challenges of having to, perhaps, completely drop and rebuild table structures and make updates at the BI layer.

If I could find a good way to use the EAV approach with BO Webi I would be more inclined to implement it. I'm curious to see if some other tools we're looking at can more easily accomodate the need for a re-pivot.



1 comment:

  1. Excellent summary! I appreciated walking through these possibilities when designing a universe for my survey data. Ultimately, for me, it came down to the question "are we doing to use the data from this survey in isolation or will we want to be able to merge it with data from other universes?" Since we were going to be merging it with other data, we went with Door #1.

    ReplyDelete