Microsoft Machine Learning for Apache Spark

A Fault-Tolerant, Elastic, and RESTful Machine Learning Framework

What's new in MMLSpark v0.17:

LIME on Spark

Major Speed Improvements and Tabular Data Support

Try an Example

Spark Serving v2

Sub-millisecond Latency, Fault Tolerance, Azure ML Integration

Learn More

LightGBM on Spark Improvements

3x-4x Faster Evaluation, Early Stopping, and Regularization Support.

Learn More

Featured Project:

Generative Adversarial Art with:

Explore the mind of a GAN trained on the Metropolitan Museum of Art's collected works. Then, find your creation in the MET's collection with reverse image search.

Explore our Features:

The Cognitive Services on Spark

Leverage the Microsoft Cognitive Services at unprecedented scales in your existing SparkML pipelines

Read the Paper

Stress Free Serving

Spark is well known for it's ability to switch between batch and streaming workloads by modifying a single line. We push this concept even further and enable distributed web services with the same API as batch and streaming workloads.

Learn More

Lightning Fast Gradient Boosting

MMLSpark adds GPU enabled gradient boosted machines from the popular framework LightGBM. Users can mix and match frameworks in a single distributed environment and API.

Try an Example

Distributed Microservices

MMLSpark provides powerful and idiomatic tools to communicate with any HTTP endpoint service using Spark. Users can now use Spark as a elastic micro-service orchestrator.

Learn More

Large Scale Model Interpretability

Understand any image classifier with a distributed implementation of Local Interpretable Model Agnostic Explanations (LIME).

Try an Example

Scalable Deep Learning

MMLSpark integrates the distributed computing framework Apache Spark with the flexible deep learning framework CNTK. Enabling deep learning at unprecedented scales.

Read the Paper


MMLSpark can be conveniently installed on existing Spark clusters via the --packages option, examples:
spark-shell --packages Azure:mmlspark:0.17
pyspark --packages Azure:mmlspark:0.17
spark-submit --packages Azure:mmlspark:0.17 MyApp.jar
This can be used in other Spark contexts too, for example, you can use MMLSpark in AZTK by adding it to the .aztk/spark-default.conf file.

Step 1: Create a Databricks account

If you already have a databricks account please skip to step 2. If not, you can make a free account on azure.

Step 2: Install MMLSpark

To install MMLSpark on the Databricks cloud, create a new library from Maven coordinates in your workspace. For the coordinates use: Azure:mmlspark:0.16. Next, ensure this library is attached to your cluster (or all clusters). Finally, ensure that your Spark cluster has Spark 2.3 and Scala 2.11. You can use MMLSpark in both your Scala and PySpark notebooks.

Step 3: Load our Examples (Optional)

To load our examples, right click in your workspace, click "import" and use the following URL:
The easiest way to evaluate MMLSpark is via our pre-built Docker container. To do so, run the following command:
docker run -it -p 8888:8888 -e ACCEPT_EULA=yes microsoft/mmlspark
Navigate to http://localhost:8888/ in your web browser to run the sample notebooks. To read the EULA for using the docker image, run:
docker run -it -p 8888:8888 microsoft/mmlspark eula
To try out MMLSpark on a Python (or Conda) installation first install PySpark via pip with pip install pyspark. Next, use --packages or add the package at runtime to get the scala sources
import pyspark
spark = pyspark.sql.SparkSession.builder.appName("MyApp") \
    .config("spark.jars.packages", "Azure:mmlspark:0.17") \
import mmlspark
If you are building a Spark application in Scala, add the following lines to your build.sbt:
resolvers += "MMLSpark Repo" at ""
libraryDependencies += "" %% "mmlspark" % "0.17"

Unsupervised Fire Safety

Spark + AI Summit Europe Keynote 2018

We use Bing on Spark, CNTK on Spark, and Spark serving to create a automated fire detection service for gas station safety. We then deploy this to an FPGA accelerated camera for Shell Industries.

Watch Now

Predictive Maintenance with UAVs

Spark + AI Summit 2018

We use CNTK on Spark to distribute a Faster RCNN object detection network and deploy it as a web service with MMLSpark Serving for use on Unmanned Aerial Vehicals (UAVs)

Watch Now

Automated Snow Leopard Detection

We have partnered with the Snow Leopard Trust to create an intelligent snow leopard identification system. This project helped eliminate thousands of hours of searching through photos.

Real-time Intelligent Analytics

Microsoft Connect Keynote 2017

We use CNTK on Spark and deep transfer learning to create a real-time geospacial application for conservation biology in 5 minutes

Watch Now