What is Machine Learning and Predictive Analytics? A Real World Example

Azure Machine Learning is Microsoft’s machine learning studio.  It provides a workbench for analysts to perform data analysis including applying predictive analytics and machine learning algorithms.  

One of the key uses of Machine Learning is finding correlations in data and using the relationships between different indicators to provide predictive power.  Here is an example scenario I built in Azure ML.

The Scenario

I found a dataset that describes a set of Community Health Status Indicators by county for the United States.  It provides a set of health rates such as homicide, cancer, obesity, suicide, etc.  In addition, it provides a set of demographic indicators such as the size of the county, the population density, the poverty rate and the population breakdown by race.

Creating a Dashboard

I created a Power BI Dashboard that summarizes some of the key indicators. 

image

While this an interesting dashboard, it doesn’t tell us what factors influence key metrics like Average Life Expectancy and with dozens of potential indicators it’s not clear which ones really are key drivers and which are less important. 

Finding Predictors with Azure ML

What if we could determine the indicators that predict Average Life Expectancy?  We could then understand the factors that impact this key metric and put them on our dashboard. 

Using Excel, I pulled the several indicator files together into a single CSV file that combined all the possible indicators together.  I then loaded this file into Azure ML Studio.

image

I then use the Project Columns module in Azure ML to pick out a number of potential columns that could impact Average Life Expectancy. 

image

Which Features Have Predictive Power?

Azure ML provides a number of methods for analyzing features and determining which of them have a strong predictive relationship with the indicator you are trying to predict. 

One of the modules in Azure ML Studio is the Filter Based Feature Selection which provides a method for filtering the number of columns based on statistical analysis.  You set the target for your prediction (in this case Average Life Expectancy) and the module goes through your list of features and finds the ones with the strongest correlations. 

In reviewing the output, here are some of the features that have the strongest correlation with Average Life Expectancy.

image

ALE is obviously the top feature since it is the one we’re trying to predict.  Features such as the number of people under 18, the poverty rate, lung cancer rate and so on seem to be the best candidates for predicting Average Life Expectancy.

How Predictive is our Model?

In order to test the predictive power of our model, we need to apply some algorithms to see if we can use the columns we selected to make an accurate prediction of Average Life Expectancy.  Azure ML Studio provides a number of industry standard algorithms for such analysis.  In this case, because we are trying to predict a variable value (e.g. could be any number) this lends itself to using regression algorithms which try to determine the equation that can provide a predicted Average Life Expectancy value based on our set of features.  Using machine learning, the algorithms try a number of feature combinations using different weightings to try to find the best fit equation that aligns to the actual results from the dataset.  We can then test the accuracy of the equation using our dataset as well.

In order to training dataset and a testing dataset, we can split our original list of 3142 rows in half, using 50% for training the dataset and using 50% for testing and evaluation.    In Azure ML, you can use the Split Data module to do exactly this.  We can use Linear Regression as our algorithm and feed it through the training model to calibrate our algorithm using machine learning.  Once this has been done, we can then score and evaluate the model to test its predictive power.

image

When you run this model, you get the following results in the evaluate model.

image

The model is a reasonably good but not excellent predictor of Average Life Expectancy.  If you look at the Coefficient of Determination, the closer this value is to 1, the better the predictive power.  In this case a 0.60 is a reasonably good score – a 0.90 or greater would be considered excellent.    If you look at the Error Histogram, this is very illustrative – this shows the error variability.  In this experiment, 48% of the results were within 0.0014 – that’s very good for an Average Life Expectancy of between 70-80 years old.  Another 30% were off by less than a year. 

However, there are a few outliers in the data where the algorithm was off by more than 3 years.

Revising Our Dashboard

What does this analysis tell us?  A few important conclusions are worth noting.

The first is the key factors that impact Average Life Expectancy seem to be:

  • Births with Mothers Under 18
  • Poverty
  • Lung Cancer
  • Low Birth Weight
  • Premature
  • Births with Mothers Under 40
  • % Black Population
  • Very Low Birth Rate
  • Infant Mortality
  • % White Population

If we’re interested in Average Life Expectancy than having these on our dashboard would provide a good explanation. 

image

In addition, we could use the predictive model to forecast Average Life Expectancy where the data is missing as long as we have these other factors.   Using Azure ML, you can turn your experiment into a web service whereby you would submit the input columns and the service will generate the predicted value based on the model.  This turns your experiment into an engine that can be harnessed to process future data as it arrives.

Read More

New Excel Add-in for Azure ML Just Released

One of the most interesting features of Azure Machine Learning is the ability to publish your pipelines to a production quality web service.  If you have developed a world class predictive algorithm you can very easily publish it to an endpoint which is now available to your users.  Your users can then provide data to be analyzed and Azure ML will output back a predicted value.

For example, imagine you have developed a pricing algorithm that predicts the optimal price for your product based on some features of your customer such as age, income, previous purchase history, location, etc.  You could create an Azure ML based pipeline that based on a training data set of your previous purchase history would find the best predictors of optimal price and allow you to then have the algorithm predict the price.  Once trained and optimized, your algorithm’s job is to provide you a predicted optimal price based on some input criteria values.

Microsoft has now created a really easy way to provide input through Excel through the new Azure ML Add-In.  Using the Add-In, you can now select cells in Excel and ship them to your Azure ML web service for processing. 

The add-in is available in the Office Store here.

Read More

Comparing IBM Watson Analytics with Azure ML

In a previous post, I compared the new IBM Watson Analytics with Power BI as a business intelligence and visualization tool.  Watson Analytics also includes a predictive analytics tool as well so let’s compare with Microsoft’s Azure Machine Learning service (Azure ML).

Watson Analytics is a Data Discovery Tool, Azure ML is a Pseudo Development Tool

The first thing you notice immediately when using both tools is the difference in their target audience.  Azure ML is targeted to developers, data scientists and very advanced business users who want to build their own analytics pipelines.  It is similar to SQL SSIS or BizTalk in its user interface.  It provides the ability to chain inputs, actions and outputs together into a pipeline and to visualize the data that is being processed along the way.

In contrast, IBM Watson Analytics is trying to take all of that complexity away – you just upload your file and Watson Analytics analyzes your data and tries to provide the best pipeline for you under the covers and present the results.

Using a cleaned up data set of automobile pricing data, here is what a linear regression pipeline looks like in Azure ML.

image

This pipeline uses a linear regression algorithm and a bayesian linear regression algorithm and compares the accuracy in predicting price from a set of existing features.

In contrast, with IBM Watson Analytics, you just upload your file and it takes care of the rest.

Azure ML is More Transparent and more Flexible

When you create a pipeline in Azure ML, you can pick and choose the algorithms that you want to run against your dataset.  If you understand the differences between a linear regression algorithm vs. a bayesian linear algorithm vs. a decision forest regression, Azure ML is the tool for you.  It also provides good error measurement to compare algorithms for their ability to predict against your dataset.  For each algorithm, you can also various configuration parameters to tweak the algorithm and hopefully improve your model’s ability to predict reliably.  You can also create specific training sets for machine learning and separate datasets for testing.  

In contrast, when you upload your file to IBM Watson Analytics, you are trust IBM to pick the best algorithm for you.  The tool doesn’t show you what type of algorithm has been run or how they were configured until you start digging into the detail screens:

image

Watson Analytics Provides Guidance on What Drives Prediction

When you upload your dataset to Watson Analytics, it provides this nice visualization to show you the different features and how they influence the predictive ability of the model.

image

The tool also shows fields and how they are correlated.

image

Watson Analytics Provides Insights Into Your Data, But Doesn’t Actually Predict

After viewing these various charts and understanding my dataset, I was interested to see how IBM Watson Analytics performed against Azure ML in actually predicting the price.  However, this feature seems to be missing!

Once you see all the influencers in your dataset, there doesn’t seem to be any way to generate the predictive value.

The closest you can get seems to be a graph that shows the features as they influence the price and the average price for each combination of those features. 

In contrast, Azure ML will populate your dataset with a predicted price for each row.

image

Azure ML Allows for Exporting, IBM Watson Analytics Does Not Export

Once you have your predicted data, you’ll want to export it to either Excel, a database or some other visualization tool.  Azure ML provides many options for exporting data at any stage in the pipeline.

IBM Watson Analytics doesn’t support any exporting options at all.  The export feature is listed as “coming soon”.

image

Azure ML Supports R and Python

Azure ML supports injection of R or Python code into your pipelines for those advanced data scientists who are developing their own algorithms.  This allows for lots of interesting possibilities for transforming, scoring and evaluating data as it is moved through the pipeline.

Watson Analytics as no such feature – as a business centric tool, it provides no ability to customize at all.

Azure ML Provides the ability to Publish to a Web Service

Imagine you have done some in depth analysis and built a model that has amazing predictive power.  How do you now share this or monetize it?

Azure ML provides the ability to take your experiment or machine learning model and publish it as a production ready web service.  Using a REST API, your users can then supply inputs and receive a prediction as an output.  You can even take your model and publish to the Azure Marketplace and charge for the model you have developed.

Read More

Microsoft Extends its Support for R through its purchase of Revolution Analytics

Microsoft has bought a small open source company called Revolution Analytics.  Revolution Analytics is a provider of an open source distribution of R, one of the most widely used data analysis software packages.    They also provide consulting services.

Revolution Analytics also has a product called Revolution R Cloud which is a distribution of R designed specifically for the cloud.  It is currently targeted at Amazon Web Services – presumably now it will be retargeted to Microsoft Azure.

This acquisition will help customers use advanced analytics within Microsoft data platforms on-premises, in hybrid cloud environments and on Microsoft Azure. By leveraging Revolution Analytics technology and services, we will empower enterprises, R developers and data scientists to more easily and cost effectively build applications and analytics solutions at scale.

R is an important language and toolset for data scientists working on advanced statistics algorithms.  Microsoft is investing in supporting R in its cloud offerings including support for integration of R code and R packages in its Azure ML offering

Read More