Data Modelling

The world of data modeling (at least as it pertained to Bill Schmarzo) started with third normal form (3NF) and E. F. Codd.  E. F. Codd defined third normal form data modeling in 1971 as a database normalization technique to improve data base processing while minimizing storage costs. 3NF data modeling was ideal for online transaction processing (OLTP) applications with heavy order entry type of needs.

When I was coding in the early 1980’s, disk space and processing power were extremely expensive. 3NF was designed to minimize the amount of data that we needed to store by ensuring that we eliminated data redundancy in storage. Heck, the entire “Y2K” panic was caused by programmers like me hardcoding “19” into the date (year) field so that we could save two bytes each time that the “year” field was used in calculations or reporting. When we were writing programs in the 1980’s, no one dreamed that our programs would still be running 20+ years later (I wonder if anyone ever found all the Easter eggs that I buried in my code, he-he-he).

As a result, we ended up with OLTP data models that looked like Figure 1.

Third Normal Form Sample Data Model

Data Modeling For Business Users

While 3NF was ideal for machine processing, the spaghetti nature of the data model was uninterpretable by a human user.  The world of analysis (a.k.a. query, reporting, dashboards) required a different type of data model that supported analysis such as trend lines, period-to-date calculations (month-to-date, quarter-to-date, year-to-date), cumulative calculations, basic statistics (average, standard deviation, moving averages) and previous period comparisons (year ago, month ago, week ago).

Ralph Kimball, while at Metaphor Computer Systems, pioneered dimensional modeling, or star schemas, in the early 1980s (see Figure 2).  The dimensional model was designed to accommodate the analysis approach of the business users via two important design concepts:

  • Fact tables (populated with metrics, measures or scores) that correspond to transactional systems such as orders, shipments, sales, returns, premiums, claims, accounts receivable, and accounts payable.  Facts are typically numeric values that can be aggregated (e.g., averaged, counted, summed).
  • Dimension tables (populated with attributes about that dimension) that represent the “nouns” of that particular transactional system such as products, markets, stores, employees, customers, and different variations of time.  Dimensions are groups of hierarchies and descriptors that describe the facts.  It is these dimensional attributes that enable analytic exploration, attributes such as size, weight, location (street, city, state, zip), age, gender, tenure, etc.

Dimensional Model (Star Schema)

Dimensional modeling was ideal for business users because it was designed with their analytic thinking in mind.  What do I mean by that?  For example, “By” analysis is a common way for capturing the reporting and analysis needs of the business users during the design phase of an analytic project.  The “By” analysis is consistent with the way the business users tend to frame their data requests such as:

  • I want to see sales and returns by month, store, geography, product category, product type and customer segment
  • I want to compare this month and last month’s claims by policy type, policy rider, zip code, city, state, region, customer tenure and customer age
  • I want to see social sentiment scores trended by social media source, subject, and day of week

Today, all Business Intelligence (BI) tools use dimensional modeling as the standard way for interacting with the data warehouse.

Data Modeling For Data Science

In the world of data science using Hadoop, we again need to think differently about how we do data modeling.  Hadoop was originally designed by Google and Yahoo to deal with very long, flat web logs (see Figure 3).  Heck, Google called it “Big Table[1]” since it was an uber large table, not a series of smaller tables tied together with joins – it was just designed differently.

Sample Log file

For example, Hadoop accesses data in very large blocks – the default block size is 64MB to 128MB versus relational database accesses block sizes are typically 32Kb or less.  To optimize this block size advantage, the data science team wants very long, flat records.

For example, some of our data scientists prefer to “flatten” a star schema by collapsing or integrating the dimensional tables that surround the fact table into a single, flat record in order to construct and execute more complex data queries without having to use joins (see Figure 4).

Flattening the Star Schema for Data Science work on  Hadoop

Taking this to the next level, the data science team will create an uber long record for each different business functions that can be more easily analyzed using Hadoop (see Figure 5).

Data Science Data Modeling on Hadoop

For example, we could have the following massively long records for an individual for whom we want to analyze:

  • Customer demographics (age, gender, current and previous home addresses, value of current and previous home, history of marital status, kids and their ages and genders, current and previous income, etc.)
  • Customer purchase history (annual purchases including items purchased, returns, prices paid, discounts, coupons, location, day of week, time of day, weather condition, temperatures)
  • Customer social activities (entire history of social media posts, likes, shares, tweets, favorites, retweets, etc.)

One technique that Dr. Pedro Desouza, the head of our data science practice at EMC Global Services, uses in order to avoid too many and frequent joins is to replicate just the key dimensional attributes into the fact table. In this way, he keeps the clear representation of the star schema but eliminates the joins by performing the heavy lifting analysis just on the flat file. The tradeoff is a little bit of data redundancy to keep clarity, but it takes advantage of the Hadoop performance characteristics.

What Do

My friend (and data scientist) Wei Lin calls this data modeling transition from relational (third normal form) to dimensional (star schema) to universal “The Good, The Bad and The Ugly” (I always had a preference for “Angel Eyes” in that movie, if you know which character that is).  But maybe that’s the wrong title? Maybe the name should be “The Complex, The Elegant and The Simple” that reflects the changing nature of data modeling.

As Xander Cage says, we need to think differently about how we approach data modeling in a big data world using Hadoop.  Old data modeling techniques that may have been optimal in a world of constrained storage and processing power, are no longer the optimal data modeling techniques in a big data world.

We need to embrace the “think differently” mentality as we continue to explore and learn from this brave new world.  It may be a painful transition, but we made the successful transition from third normal form to dimensional modeling…eventually.

What Does This Mean?

It means is that the way that we designed data models for OLTP applications (using third normal form) and for data warehousing (using dimensional modeling) needs to change to take advantage of the inherent architecture and processing advantages offered by Hadoop.  Data scientists will create flat data models that take advantage of the “big table” nature of Hadoop to handle massive volumes of raw, as-is data.

Leave a comment