Experimenting with SQL Azure as a Staging Database for a Data Warehouse

In a typical EDW, one of the biggest costs in maintaining the infrastructure is the ETL and staging databases used to crunch the incoming data before turning it into a cube.  In large data warehouses, the staging database ends up consuming a significant amount of storage and requires database servers to manage them, all costing significant costs in terms of both infrastructure and licenses.

So what if we could move all of this processing to the cloud and maximize the use of SQL Azure to act as our staging area?  

Here is our on premise architecture:

image

In running this architecture on my laptop, I have this all running on a single machine and it’s quite slow to process.  If I scaled it to a production infrastructure, I could generate some reasonable performance in processing the 122 million rows in the dataset.

But instead, what if we moved our SQL database to Azure SQL and what if we moved our cube and SSIS processing to an Azure VM? 

The Potential of a PAAS SQL Azure based Approach

SQL Azure is a PAAS service which means you don’t have to pay for a SQL cluster, servers, etc.  Instead, you pay per gigabyte per month for what is stored in your SQL Azure database.

SQL Azure is cheap (although not as cheap as Table or Blog storage) – ~$50 for 10 gigabytes and ~$132 for 50 gigabytes.  If you compare that to the cost of a SQL VM, this significantly cheaper than the hundreds to thousands of dollars a month for a SQL VM running in Azure. 

In addition to the cost difference, there is also the advantage of your database being spread across many individual servers with lots of available IO for processing, without the need for higher performance storage.

The other potential advantage is the ability to scale up ETL VMs when doing heavy processing loads and then either scale them down or shut them down entirely when not needed.  For example, imagine we have 10 million records arriving once a month that need to be fed into our warehouse.  In the traditional on premise world, we would need to have ETL servers sitting there idle for most of the month except for the days where new files show up.  In our cloud approach, we could save costs by only spinning up our VMs when needed and only paying for the work on demand.

Redefining Our Architecture to Use SQL Azure

In order to test this approach, I set up a cloud based architecture that looks like this:

image

So our staging database is now SQL Azure and we have a Azure VM that has SSIS and SQL Server for performing ETL and Cube generation.

Moving to the Cloud

Porting the existing architecture to the new architecture was surprisingly easy.  I took the original on premise SQL schema, ran it against the SQL Azure DB I created and I had a working staging database in SQL Azure DB.

I took my original Visual Studio SSIS project and copied it up to my new Azure VM.  I reconfigured the data sources to point at the new SQL Azure DB and everything worked like a charm! 

I then pointed the SSAS project data source to pull data from the SQL Azure DB and the target to be a new cube created in my Azure SQL VM.  Again, easy to do and with a few records in the DB I was able to confirm the cube built correctly!

Lessons Learned In Processing Large Volumes of Records

How Fast is It?

In my current tests, we’re running about 6-7 million records per hour.  This isn’t terribly fast for a production environment. 

In setting up this little test, there are a few lessons learned in processing the data through the ETL process.

image

The Bottleneck is now Network, not I/O

One of the key bottlenecks in ETL processing is I/O which is why SQL DBAs are continually focused on improved disk performance, in memory caching, etc.  However, when I run the same ETL job against SQL Azure, I/O is no longer the bottleneck – it’s the network.

Here is the DTS Debug Host slamming records into the DB as fast as it can go:

image

As you can see, it barely registers from a CPU perspective.  Here is the memory profile:

image

Again, we’re barely registering in terms of memory usage.  Now let’s look at network:

image

As you can see by the graph, the network is getting slammed as SSIS shoves records into Azure SQL.

Is SSIS Cloud Optimized? 

We have just taken regular SSIS and ran it against a cloud based database.  However, it’s not clear that SSIS uses any different strategies for inserting records into SQL Azure DB than it does for a local database.  It’s also not clear whether we’re maxing out the network or whether SSIS is waiting around for handshakes, HTTP connections, etc. and so therefore not sending as much data over the wire as possible. 

We May Need a Better ETL Tool

We may need a custom ETL script or tool that is Azure aware that can maximize the available resources to push data into the database as quickly as possible.  For example, I wrote a custom import tool for Azure Table Storage that takes advantage of transactions, multiple threads of execution, asynchronous calls to the database, etc.  This is something we’ll look at in a future proof of concept.

Based on the performance above, it’s not clear whether the performance is fast enough given the current tooling.  However, this needs some further experimentation…stay tuned.