Designing a Data Warehouse with Granular Data in Mind
How many times have you heard the phrase, “We only need to report at the Category (or some level above the detail level) Level,” when you are gathering requirements for designing your Data Warehouse or DataMart? And then, inevitably, 6 or 8 months later someone needs a report, or needs to be able to analyze at the Product level.
There are a few ways to handle this situation. The first is to build the Data Warehouse at the level requested. I recommend that you document this as being at the specification of the Business users involved, that way if the database design needs to be revisited in the future, the developer will know when the decision was made to be at the higher grain level.
The second option is to determine the lowest grain of the type of analysis that may need to be done for the first set of analysis and reporting. This requires that the person who is gathering the requirements be able to think about the way the existing reports are written and think about what would be a useful addition? Often the business people are very busy, and spending time thinking about the possibilities of different reports requires more time than they have at the moment.
The third option is to try to anticipate if the involvement of other departments or types of analysis might be required in the future, and to add the required detail into the data warehouse. This approach requires that the developer has an extraordinary understanding of business processes and analytical needs, and can anticipate how different departments will want to do their analysis. This approach does not usually happen, and it is extremely challenging and time consuming.
The fourth option, and the recommended option, is to build a data warehouse that has the flexibility to readily add to the design, instead of trying to build every option up front. We do this by capturing the data at the transaction level, and have the design allow adding information as opposed to changing the basic design. The potential challenge here is that the amount of data may require additional hardware resources, and a longer load time to capture the extra data. This cost could be offset by the savings of not having to rebuild the application (and the hardware may be required anyway).
Let’s look at a hypothetical business scenario. The Finance department of the Acme Shirt Store wants to build a budget application. The budget is defined at the category level, and to simplify, we will limit it to one category: Shirts. The budget might target increases of 5% for all Shirts.
When you budget for something, you need to compare your sales against the budget to see how you’re doing. Since the requirements are to budget against the category, tracking against the category also seems to make sense. So to be efficient, we roll the sales up to the category level every day. Less data to store, fast response for reporting, everything’s great. Then in a few weeks, the marketing department would like to know how Dress shirts are doing vs. Sports shirts. The application doesn’t track at subcategories, so now modifications to the application need to made to determine subcategory estimates – algorithms created to determine ratios to multiply the values by, and come up with close estimates to the real numbers. Pretty soon they want to know about long sleeves vs short sleeves, prints vs solids, what colors are doing the best, and don’t forget about tracking the different sizes.
The original, efficient, fast design is now so complex, that every change has to make sure that it does not undo one of the previous modifications. This becomes very challenging.
With some forethought in the beginning, we could capture the sales at the item level (SKU or scan). We design the Product dimension to have a column for Category. It is easy to group by Category, and compare to the Budget numbers. When the marketing department asks for tracking numbers of Dress shirts vs Sports shirts, all that needs to be done is to add another column with this subcategory. Long sleeve or short sleeve, another column. Print or pattern – you get the idea.
Right now you’re saying to yourself, “Nobody has a product dimension that only has category as its lowest grain level” – and you‘d be right. I used this example to illustrate the issue. But it is common for some product dimensions to have the grain be at the “product” level, in this case something like “button down oxford dress shirt.” A lot of useful information can be reported at this level, for budgeting and year over year trending. The challenge is this – what sizes sell the most? What colors are the most popular? Which colors and sizes have to go on clearance? These answers can only be gotten if the products are tracked at the item (scan) level. Then specific attributes can be added to the product dimension anytime the business users want – of course the business users would have to provide the attributes and the descriptions that they want to track, but it can be done.
If we think back to the original requirements, to be able to budget and track sales progress at the Category level, we now have enough insight to understand there are two options: First is to go along with the requirements at face value, knowing that in the future retrofitting lower grains of detail is going to be messy, expensive and time consuming. Or, second, to capture the sales at the lowest grain, being sure to have the Category attribute column established so grouping at that level is simple. This path will require more hardware, and maybe more processing time, but the foundation will be there for enhancements and modifications in the future.
There are other challenges to this approach, of course. Next time we will look into some of those.