Combining PowerPivot and SSAS Cubes Effectively is Key to Analyzing Large Datasets

One of the key questions raised when I demo PowerPivot is, “do I still need a data warehouse?” or “do I still need a cube?” (see my previous blog post on when you still need a cube)

Here is an example of where using PowerPivot and SSAS together provides an interesting approach to answering this question.

I have a reasonably sized data warehouse model running on my laptop for the purpose of testing and illustrating concepts with SQL Server, SSAS, and the latest Power BI tools from Microsoft.  It is based on 122 million records available from the US government for every airline delay over the past couple decades.  The raw data is available here.

With this size of data, loading this straight into PowerPivot isn’t really going to work (see my previous blog post where I attempted to load 122 million records straight into PowerPivot).

So how do we combine the processing power of cubes and a real SQL database with the versatility of PowerPivot and PowerView?  Here are some strategies…

Understanding the Underlying Data Model

The airline delays data is ideally suited for classic dimensional modeling.  The data contains essentially one big fact table with key measures relating to airline delays.  Here is the table I created to store this data:

image

In addition, I created a set of dimension tables to store the various look up values that are needed by the model including time, date, airline numbers, airport names, etc.

Loading Up the Data

The data available is a set of large CSV files.  I created a staging table to load in the raw data into SQL Server and then a set of SSIS packages for populating first the dimension tables and then the large fact table. 

One of the key things that SSIS provides you over PowerPivot is a proper ETL tool and in this case it was absolutely vital to process the data properly.  I found lots of small issues when loading in such a large dataset including needing to convert values to the right type, convert null values to a more appropriate entry in the model, removing bad data, converting records where the plane model wasn’t available to “not available”, etc.  While PowerPivot is a great tool for loading in clean data, when processing large files from suspect sources you really need a real ETL tool to process data properly.

Creating the Cube

I created the cube model using Visual Studio 2012 and loaded up several years worth of data.  Once my cube processed successfully, I was now able to access it in Excel.

Using Excel PowerPivot and Cubes Together

PowerPivot can use an SSAS cube as a data source.  The query is an MDX query that you can design through a built in query builder. 

image

So here is the key idea to using PowerPivot and Cubes together – choose the right dimensions and level of granularity you need for your PowerPivot query. 

In this case, we have 122 million records organized through dimensions by time, airline, plane model, etc.  We need to be a little bit selective about how we create the query so that we only fetch some level of aggregated values instead of every record.

As an example, imagine we want to create a PowerView or PivotChart that displays the average delay for each hour of the day.  While we have a large number of rows, if we pick our slices judiciously we can limit the rows pulled into our PowerPivot model to the appropriate level of granularity that we need for useful business analysis.

For example, I built a query that fetched departure delay and arrival delay across the dimensions by hour, month and carrier for the years 1998 to and including 1994.  This is equivalent to ~35million records but because it’s summarized by day it only fetches the aggregate of many records for each day.  The resulting query is 32K records which is a more reasonable size to refresh frequently.

image

This takes seconds to refresh across the entire cube.

Using the Data in PowerView to Answer Questions

Based on our PowerPivot model, we can create a calculated measure for average departure delay and average arrival delay across our 32K records.

  • Average Arrival Delay:=SUM([Arrival Delay])/Sum([Number of Flights])
  • Average Departure Delay:=SUM([Departure Delay])/Sum([Number of Flights])

So based on this dataset, we can use PowerView to create a nice looking and informative dashboard.

image

image

image