New Hybrid Connection Manager Enables Integration with Azure and Your Internal Network

Microsoft unveiled its latest feature within the Azure service model – the Hybrid Connection.

image

Hybrid Connections allow your Azure web site or mobile service to connect back into your internal network without the need for a VPN or opening up your firewall.  Hybrid Connections support connections to HTTP Web APIs, Web Services and databases such as SQL Server or MySQL.

Essentially, the Hybrid Connection Manager acts as a proxy client in your network to connect OUT to Azure over port 5671 instead of having your web site connect IN to your network.  The Hybrid Connection Manager sounds a lot like the Office 365 Data Management Gateway.

There is a full walk through of connecting an Azure Web Site and an internal SQL DB here.

Read More

Microsoft Unveils New Features to Mature Your Enterprise Cloud

As enterprises embrace the cloud, there is an increasing demand for enterprise class connectivity, hybrid data center scenarios and increased network flexibility between various cloud networks, internal networks, etc.  Microsoft has announced a number of features designed to improve the robustness, flexibility and security of your enterprise hybrid cloud networks.

Azure expressroute

Through partnerships with AT&T, BT, Equinix, Level3, SingTel, TelecityGroup and Verizon (which more partners presumably rolling out in the future), ExpressRoute provides a dedicated pipe between these partner data centers and Azure instead of routing traffic through the Internet.  This improves reliability, redundancy and overall bandwidth available for those hybrid scenarios which depend on moving a lot of data from your internal data center to Azure and back again.

Connectivity Overview

Note that ExpressRoute only works with IAAS – e.g. you cannot use it to connect to SQL Azure. 

VPN Improvements

The Azure Virtual Network now supports more than one site to site VPN so that you could create an Azure Virtual Network and connect it to multiple on-premise locations.  In addition, you can now link multiple virtual networks together in Azure.

clip_image032

Static IP Addresses

You can now reserve IP addresses and have them virtually allocated as static IP addresses.  This is a key requirement for applications that don’t respond to swaps in IP addresses (SharePoint 2013 being a good example).

Read More

Scheduled Data Refresh from Power Query Now Available

Microsoft has just released an update to Power BI to allow for scheduled data refreshes from Power Query.

Although currently limited to SQL and Oracle databases, the intent is to extend this to additional data sources in the future.  As with the existing SQL and Oracle data connectors, you need to install the Data Management Gateway in your internal network to act as a proxy to Power BI to create a connection to your internal data sources.

3482.dr1.png-550x0[1]

Read More

New Office Mix Online Courseware Tool Released in Preview

As blogged about previously, Microsoft is developing a new courseware development tool based on Office 365 and PowerPoint.  The preview version of Mix is now available to those who signed up for the release.

image

The key functionality available includes:

image

  • Recording yourself giving a presentation
  • Inserting quizzes
  • Screen recording
  • Inserting screenshots
  • Preview of your course
  • Uploading and sharing through Office 365
  • Exporting the entire presentation to video

There are already some files uploaded to the site that you can view as well as some help tutorials on how to create your own presentations using the tool.

image

image

image

Read More

Experimenting with SQL Azure as a Staging Database for a Data Warehouse

In a typical EDW, one of the biggest costs in maintaining the infrastructure is the ETL and staging databases used to crunch the incoming data before turning it into a cube.  In large data warehouses, the staging database ends up consuming a significant amount of storage and requires database servers to manage them, all costing significant costs in terms of both infrastructure and licenses.

So what if we could move all of this processing to the cloud and maximize the use of SQL Azure to act as our staging area?  

Here is our on premise architecture:

image

In running this architecture on my laptop, I have this all running on a single machine and it’s quite slow to process.  If I scaled it to a production infrastructure, I could generate some reasonable performance in processing the 122 million rows in the dataset.

But instead, what if we moved our SQL database to Azure SQL and what if we moved our cube and SSIS processing to an Azure VM? 

The Potential of a PAAS SQL Azure based Approach

SQL Azure is a PAAS service which means you don’t have to pay for a SQL cluster, servers, etc.  Instead, you pay per gigabyte per month for what is stored in your SQL Azure database.

SQL Azure is cheap (although not as cheap as Table or Blog storage) – ~$50 for 10 gigabytes and ~$132 for 50 gigabytes.  If you compare that to the cost of a SQL VM, this significantly cheaper than the hundreds to thousands of dollars a month for a SQL VM running in Azure. 

In addition to the cost difference, there is also the advantage of your database being spread across many individual servers with lots of available IO for processing, without the need for higher performance storage.

The other potential advantage is the ability to scale up ETL VMs when doing heavy processing loads and then either scale them down or shut them down entirely when not needed.  For example, imagine we have 10 million records arriving once a month that need to be fed into our warehouse.  In the traditional on premise world, we would need to have ETL servers sitting there idle for most of the month except for the days where new files show up.  In our cloud approach, we could save costs by only spinning up our VMs when needed and only paying for the work on demand.

Redefining Our Architecture to Use SQL Azure

In order to test this approach, I set up a cloud based architecture that looks like this:

image

So our staging database is now SQL Azure and we have a Azure VM that has SSIS and SQL Server for performing ETL and Cube generation.

Moving to the Cloud

Porting the existing architecture to the new architecture was surprisingly easy.  I took the original on premise SQL schema, ran it against the SQL Azure DB I created and I had a working staging database in SQL Azure DB.

I took my original Visual Studio SSIS project and copied it up to my new Azure VM.  I reconfigured the data sources to point at the new SQL Azure DB and everything worked like a charm! 

I then pointed the SSAS project data source to pull data from the SQL Azure DB and the target to be a new cube created in my Azure SQL VM.  Again, easy to do and with a few records in the DB I was able to confirm the cube built correctly!

Lessons Learned In Processing Large Volumes of Records

How Fast is It?

In my current tests, we’re running about 6-7 million records per hour.  This isn’t terribly fast for a production environment. 

In setting up this little test, there are a few lessons learned in processing the data through the ETL process.

image

The Bottleneck is now Network, not I/O

One of the key bottlenecks in ETL processing is I/O which is why SQL DBAs are continually focused on improved disk performance, in memory caching, etc.  However, when I run the same ETL job against SQL Azure, I/O is no longer the bottleneck – it’s the network.

Here is the DTS Debug Host slamming records into the DB as fast as it can go:

image

As you can see, it barely registers from a CPU perspective.  Here is the memory profile:

image

Again, we’re barely registering in terms of memory usage.  Now let’s look at network:

image

As you can see by the graph, the network is getting slammed as SSIS shoves records into Azure SQL.

Is SSIS Cloud Optimized? 

We have just taken regular SSIS and ran it against a cloud based database.  However, it’s not clear that SSIS uses any different strategies for inserting records into SQL Azure DB than it does for a local database.  It’s also not clear whether we’re maxing out the network or whether SSIS is waiting around for handshakes, HTTP connections, etc. and so therefore not sending as much data over the wire as possible. 

We May Need a Better ETL Tool

We may need a custom ETL script or tool that is Azure aware that can maximize the available resources to push data into the database as quickly as possible.  For example, I wrote a custom import tool for Azure Table Storage that takes advantage of transactions, multiple threads of execution, asynchronous calls to the database, etc.  This is something we’ll look at in a future proof of concept.

Based on the performance above, it’s not clear whether the performance is fast enough given the current tooling.  However, this needs some further experimentation…stay tuned.

Read More

New Features Available in latest Power Query Update

Microsoft has just released a new version of Power Query with some enhancements.  You can download the latest version of Power Query here.

For a full detail of all the new features, check out the business intelligence blog here

New features include:

  • New Data Sources
    • Updated “Preview” functionality of the SAP BusinessObjects BI Universe connectivity
    • Access tables and named ranges in a workbook
  • Improvements to Query Load Settings
    • Customizable Defaults for Load Settings in the Options dialog
    • Automatic suggestion to load a query to the Data Model when it goes beyond the worksheet limit
    • Preserve data in the Data Model when you modify the Load to Worksheet setting of a query that is loaded to the Data Model
  • Improvements to Query Refresh behaviors in Excel
    • Preserve Custom Columns, Conditional Formatting and other customizations of worksheet tables
    • Preserve results from a previous query refresh when a new refresh attempt fails
  • New Transformations available in the Query Editor
    • Remove bottom rows
    • Fill up
    • New statistic operations in the Insert tab
  • Other Usability Improvements
    • Ability to reorder queries in the Workbook Queries pane
    • More discoverable way to cancel a preview refresh in the Query Editor
    • Keyboard support for navigation and rename in the Steps pane
    • Ability to view and copy errors in the Filter Column dropdown menu
    • Remove items directly from the Selection Well in the Navigator
    • Send a Frown for Service errors

Read More

Sometimes an Ordinary Pivot Table is Faster/Better than PowerPivot

One of the ways to interact with a SSAS cube is through an ordinary Excel Pivot Table.  You can do this by the following steps:

1. Insert Pivot Table

image

2. Select “Use an external data source”.

image

3. Choose or create a connection to your SQL SSAS cube.

Once you have established your connection, you can choose measures and dimensions from your cube to plot out your pivot table.

The Advantage of a Pivot Table Over PowerPivot

When PowerPivot fetches data from a cube, it has to pre-fetch all the possible cells to be used within the model as a local cache.  This results in a lot of rows to be fetched especially if there are a lot of dimensions as the cells end up being all possible combinations of these dimensions.

In contrast, a pivot table fetches only data as needed.  This makes it much faster as you drill up and down various dimensions.  For example, I have a cube containing 35 million records for airline delays.  I can create a pivot table against this cube by connecting to it from excel and inserting a pivot table.

image

The key difference with a pivot table compared to PowerPivot is it only fetches cells in the cube that are needed to display what is needed instead of pre-fetching all possible cells.  This makes manipulating the cube measures and dimensions much faster especially with large data sets.   Here is a pivot table that shows airline delays by hour:

image

If I click on any of the hour, the moment I click on the pivot table Excel runs a query against the cube to fetch the cells needed to display the data:

image

The speed is very fast to drill because its only actually pulling up 60 rows (one for each minute).  If we tried to do this in PowerPivot, it would need to pull in every row for every second to create a local data set and this would be very slow to fetch the data.

The Big Advantage – Refreshing the Data

In addition to being high performance to create the pivot table in the first place, its also much faster to refresh the data.  Again, the reason is because Excel only fetches the data it needs instead of trying to create a local data cache in memory for all possible records.  The refresh on my pivot table is less than 1 second for 35 million records! 

The Big Limitation – PowerView only works with PowerPivot

The big limitation to this approach is that PowerView only works with PowerPivot.  It will not all you to use an ordinary pivot table as a data source.  You are limited to creating dashboards using Pivot Charts.

image

Read More