From: owner-dwlist@datawarehousing.com [mailto:owner-dwlist@datawarehousing.com] On Behalf Of Ralph Kimball
Sent: Monday, June 18, 2001 6:48 PM
To: dwlist@datawarehousing.com
Subject: FW: dwlist: Kimball or Inmon or mixed ?
http://www.DataWarehousing.com is sponsored by DataMirror Corporation, a leading provider of real-time data integration and resiliency solutions. Please visit our sponsor today at http://www.datamirror.com to download fully functional software and data warehousing white papers.
From: "Ralph Kimball" <ralph@ralphkimball.com>
Hi Madani,
I have never been comfortable with religious positions, or "camps", or rules that say a data warehouse is this or a data mart is that. The range of responsibilities and design challenges that we in the data warehouse community have is too varied to make high level pronouncements apply in every case. What is needed is flexible judgment, not rigid rules.
Fundamentally, I think a data warehouse manager has two over-arching responsibilities:
1. to collect and archive an organization's data, and especially
2. to publish the right data to end users most effectively.
In my opinion, all the rest is technique. If you do a great job with these two responsibilities, then I am an eager student of your approach, whether you use E/R modeling, dimensional modeling, ROLAP or OLAP.
I have written extensively about a set of techniques called dimensional modeling because I have seen these techniques work in many different situations. In my writing about dimensional modeling, I have tried to take an engineer's practical approach to solve specific problems that every data warehouse designer must eventually consider, including
* multiple organizations, multiple data sources, and multiple incompatible technologies. In other words, a distributed data warehouse environment that can never be fully centralized.
* continuously evolving data sources and business requirements, that require graceful extensibility
* finite budgets that preclude getting it all designed at the beginning, and that require an incremental approach
* "unreasonable" demands to deliver results (data) from the data warehouse in weeks, not years.
I have addressed these situations with very specific recommendations down to the bit level of design including
** slowly changing dimensions, types 1, 2, and 3, including detailed administrative scenarios for classifying changes in data and handling each of these cases
** specific recommendations for handling the assignment of keys, including the design of a surrogate key pipeline for the back room data staging area
** specific techniques for handling post-dated data that arrives late
** specific techniques that place an upper bound on the data explosion and administrative complexity of handling real data problems including: "slowly changing" dimensions that turn out to be "rapidly changing"; multiple roles played by a single dimension; dimensional attributes that legitimately take on multiple values at the time of a single measurement; dimensional attributes that possess a ragged hierarchical structure
** specific techniques for handling what I believe are the three fundamental grains of fact tables: transaction, periodic snapshot, and accumulating snapshot
** specific techniques for representing what didn't happen, and for asking what doesn't exist
** specific architecture for combining results across incompatible technologies in different physical locations (drilling across using conformed dimensions). I have described the details of this processing down to the specific SQL needed to assemble the high level answer sets.
** specific ways to handle unexpected new measurement types, unexpected new descriptive attributes, unexpected new dimensions, unexpected new natural keys, and unexpected new data sources.
** specific techniques for getting a result based on a single data source to the end users expeditiously, while having a coherent framework and plan for delivering integrated results at a later time that depend on multiple data sources (so-called first and second level data marts)
** specific techniques for building and taking advantage of aggregates, which are the single most powerful technique for improving performance in a data warehouse environment
** specific techniques for drilling down from aggregated data to atomic data with a single end user tool in a single physical data environment.
Madani, when you quote a consultant who recommends using an E/R approach, I am concerned that you are leaving the readers with a dilemma. The E/R approach is elegant and appealing, but where do you to learn detailed technique? Is there a body of knowledge, anywhere, that gives the data warehouse designer guidance?
I would love it if someone would step up to all the points I've listed above, but expressed in an E/R framework. Especially,
>> what are the detailed implications of administering a "time variant E/R model"? Does this mean date keys embedded in every entity? Begin dates as well as end dates? How many date keys in how many tables do you need to snapshot a complete profile of a customer or a product at a point in time? How do you administer the keys to maintain this snapshot? What is the E/R version of a surrogate key pipeline? Do you even have surrogate keys in an E/R model?
>> how do you deal with the fact that E/R schemas CHANGE every time the designer decides to physically instantiate a new many-to-1 relationship? This means RECODING all the applications. "Graceful extensibility" means that old applications keep working even after unexpected changes have been made. In the dimensional modeling world, we have specific criteria for describing and managing these unexpected changes and for understanding the boundaries of graceful extensibility. Where are the comparable guidelines in the E/R approach?
>> Which tools can you use to drill down from aggregated data to atomic data if you insist on defining a data mart as only containing aggregated data, and that the remote data warehouse contains atomic data in an E/R format? What are the SQL transformations needed to achieve this?
Madani, you have a typical situation where your input data is in a flat file form. You clearly need mechanisms for ensuring data quality. You can check a many-to-1 relationship in a flat file with a single sort. I like sorting as a primary data cleaning technique because it also flushes out other kinds of issues like typos and variant spellings with data where there may be no many-to-1 constraints that would show up in an E/R model. E/R models only enforce one data relationship: many-to-1 mappings.
Returning to my points at the beginning: a data warehouse designer should choose whichever techniques he or she feels comfortable with, keeping in mind the highest goals of data archiving and data publishing. I think it would be terrific if other practitioners who have built successful data warehouses would share their techniques, as long as they are specific and detailed in their recommendations.
Ralph Kimball