Excel 2013 is an amazing tool in terms of its increased scalability and its ability to handle large data sets. Combined with PowerPivot, PowerView and Power Query, Excel 2013 is now Microsoft’s leading tool for business intelligence.
What is amazing is how much data Excel 2013 can process in memory – no need for cubes, data warehouses, centralized servers, etc. if you can just load some large data files straight into Excel.
When we hear about “Big Data”, how big is big? Just using Excel 2013 and some raw CSV files, I can load quite a lot of data.
14 Million Records
Earlier I wrote a blog post where I loaded approximately fourteen million records from the City of Toronto representing every parking ticket issued in the past several years. Using this dataset loaded into PowerPivot, I can use PowerView to chart the volume of parking tickets by day of the week, by year or by month.
Loading up 14 million records in Excel 2013 takes about 3 minutes to completely refresh the data source from the original data source in Power Pivot. Once the data is loaded, the Power View refresh is fast enough to be highly usable.
122 Million Records!
In looking for large data sets, I found the publically available data of every airline delay in the United States from 1998 to 2008. This data is available from the Bureau of Transportation Statistics. I downloaded all of the CSV files and dropped them into a single directory. There are approximately 122 million records and over 11 gigabytes of data across the entire set of files.
This dataset has been thought of as “Big Data” historically and has been used for testing both generic statistical analysis techniques as well as specifically for trying to optimize the prediction of delays based on factors such as weather, origin of the flight, etc.
As an experiment, I tried loading the entire dataset into Excel 2013. To load multiple files into Power Query, you can choose a folder location and pick up all the files in that folder.
If you then click on the expand arrows to the right of the Content column, you can select the actual contents of the files to be imported instead of just the file metadata.
You can then Apply and Close and let the import begin!
Here are some interesting performance stats while my machine was chugging away:
As illustrated by the graphs, performance is dependent on two key things: 1) available RAM and 2) available disk IO. CPU is barely being taxed. This intuitively makes sense – essentially this is a very large loading job from disk into RAM.
The results are that it worked – I was able to import 122 million records into Excel. Here is my PowerPivot table with 122 million records! From beginning to end, it took approximately 1 hour to load the entire data set into memory.
Is the Data Usable?
I was able to load the data, but is usable? Here are a few quick tests to see if I could actually manipulate the data in some meaningful way. While not an exhaustive test, here are a few basic usability tests I tried to determine whether Excel could do some basic reporting on the dataset.
Loading and Saving the Excel File
After loading up all this data, could I actually save the Excel file and could I reload the file from disk?
Saving the file to disk was quite fast – it took about 2 minutes to save the file. The file itself is about 2.2 gigabytes – there must be some compression going on here because the original files were more than 11 gigabytes.
Loading the file was very fast – it took seconds. However, when I tried later after my computer went to sleep and I restarted Excel, it was significantly slower – about 2-3 minutes.
Linking Data to Other Tables
As part of the original dataset, there was also supplied three CSV files that had data for Airports, Carriers, and Planes. I loaded each of these into my model. In addition, I added a Date table from Azure Marketplace. I linked these into my PowerPivot model as relationships.
Most of these were quite quick to create – within a minute.
Create a Calculated Column
The Airline Delays dataset provides columns for Year, Month, DayOfMonth, and DayOfWeek. So I created a column to convert the Month as a number to the Month name (e.g. 2 = February). I also created a column to extract the hour from the Departure Time column.
Running these basic calculations was pretty fast – about a minute.
Creating a Power View
I tried to create a Power View sheet with the Delay Hour field and the DepDelay field to show how departure minutes are related to the time of the day. This was VERY slow – it took minutes to refresh the data every time I made a change.
This is where having a pre-aggregated cube in SQL Server would make things much faster – having everything aggregating dynamically is slow when dealing with volumes this large.
But it works!
Conclusion: Pushing the Limits of an In-Memory Based Database
14 million records was useable – I can manipulate the data, save the spreadsheet, load it into a Power View and it works quite well. It’s not fast but it’s usable. I could imagine doing analysis against this data pretty easily. I can do aggregation really easily against dates to filter the data by year, month, day, etc.
At 122 million records, Power Pivot shows its limits. As everything is loaded into memory dynamically with no pre-indexing, once my laptop uses up its available RAM, it becomes slow to process the data. Working with PowerView also becomes challenging because every time you change the configuration of a chart or view, it takes minutes to recalculate the result.