Azure Databricks for Data Science & Engineering Teams – Part 1
The analytics arm of a large midwestern U.S. insurance firm needed to analyze its current state architecture used by its teams of data engineers and data scientists. The company asked SPR to help by focusing on the portion of its architecture making use of Azure HDInsight, and building a proof of concept (PoC) on Azure Databricks Unified Analytics Platform for the company’s future state architecture.
The company had 3 key criteria for implementation of the PoC:
- Focus on use of the R language for machine learning (ML) models
- Account for current data science team development processes
- Preference for use of Azure Data Factory for data pipelines, which the current data engineering team was already looking to adopt
Using these criteria, SPR first worked with the data science team to understand current development processes and to define the PoC scope.
Background on HDInsight and Databricks
At the time, the data scientist team used Azure HDInsight as its primary tool for executing R language ML models with Apache Spark. Of note, Azure HDInsight is an Azure distribution of Hadoop components from Hortonworks Data Platform (HDP), which has been one of the three most recognized Hadoop distributions alongside Cloudera Distributed Hadoop (CDH) and MapR Data Platform (MDP).
Apache Spark is a separate component, which provides an interface for programming entire clusters with implicit data parallelism and fault tolerance. Since the data science team primarily made use of Azure HDInsight for executing ML models on Apache Spark through Apache Hadoop YARN, the fact that Hadoop was being used to do so resulted in the current state ecosystem extending considerably beyond Apache Spark.
Another product, Databricks, was made generally available for Azure in March 2018, just 6 months prior to the inception of the SPR engagement (Databricks was made available for AWS in June 2015). Azure Databricks is an Azure cloud optimized analytics platform based on Apache Spark. It provides fully managed Spark clusters, an interactive workspace for exploration and visualization, and a platform for powering Spark-based applications.
The Azure implementation of Databricks is so tightly integrated with Azure that the service is a first-party Microsoft offering provisioned as with other Azure-branded services such as Azure HDInsight, even while the Databricks product itself is from a third-party (albeit designed in collaboration with Microsoft).
While Azure HDInsight as a product offering is not expected to go away any time soon, Microsoft claims its Azure Databricks offering provides the industry’s fastest Spark implementation: 5 times faster than Apache Spark (which its creators already claim runs workloads up to 100 times faster than Apache Hadoop).
Databricks also comes packaged with a built-in Hive Metastore providing backward compatibility via Spark SQL with a common abstraction layer for data currently being used with HDInsight. It was created by the founders of Spark who continue to innovate, with recent RStudio collaborations that bring an option to integrate RStudio Server with Databricks, as well as R support for its open source MLflow product, as examples.
Since it was the latest version available when the project started, SPR largely focused on Azure Databricks 4.3, which uses Apache Spark 2.3.1. Databricks 5.0, which uses Apache 2.4.0, was released just a week after the new Spark version was released in November 2018, which makes sense knowing that Databricks comes from the creators of Spark.
Contrast this situation with Azure HDInsight 4.0, which is in preview mode and still makes use of Spark 2.3.1, first released in early June 2018. While just a few months apart from one another, remember that Databricks 5.0 together with Spark 2.4.0 resolves over 1,000 outstanding tickets (not to mention availability of new SparkR features applicable to this data science team), so earlier release dates are not just a matter of semantics: there are real benefits to getting access to releases sooner.
Assessing Available Tooling
With things moving so quickly with Databricks and the broader Azure ecosystem, SPR assessed what was available – both for products ready for production, as well as those still in the works and so not yet mature enough for adoption. And while it is important to discuss what tooling we used, it is also important to understand what we didn’t choose and why. Let’s take a look at some examples:
Azure Machine Learning service – SPR determined Azure Machine Learning service, the third incarnation of a Microsoft offering, was not ready for use (it was not made available until the final week of SPR’s final deliverables).
MLflow – We also chose not to use an open source framework from Databricks called MLflow, which had just added R language support the week prior to the project start. As the data science team had just migrated away from SAS, it was especially important to assess the level of available R support for needed Databricks features, at least for an initial migration. That said, we recommended that the company keep a close watch of this framework, as its purpose is to close the gap that currently exists in standardizing the end-to-end ML life cycle.
Other tooling we did not use included additionally available Microsoft tooling, because the data engineering and data science teams already make use of Azure:
Microsoft Visual Studio 2015 and 2017 – We assessed Microsoft Visual Studio 2015 and 2017 specifically with respect to R and Azure related tooling. While we noticed availability of a “Microsoft Azure Data Factory Tools for Visual Studio” extension for Visual Studio 2015, we also saw that this extension is not available for Visual Studio 2017, and that there are apparently no plans to incorporate comparable tooling in Visual Studio 2019 or later releases of Visual Studio 2017 (although the Visual Studio roadmap hints at some possibly relevant features for 1Q2019).
Microsoft R Tools for Visual Studio, Microsoft R Client, Microsoft Machine Learning Server – All of these products are interrelated because Microsoft R Client is used to remotely connect to Microsoft Machine Learning Server, and Microsoft R Client is available for this connection via Visual Studio 2017. We successfully demonstrated the setting up of a remote connection from Visual Studio 2017 to a Microsoft Machine Learning Server instance in Azure after working through largely undocumented configuration setup. However, we did not recommend using this tooling because it does not seem to be positioned well.
Databricks Proof of Concept (PoC)
Once we determined what we would not be using for the PoC, we moved forward with the other Azure components that we would be using. In addition to Azure Databricks, we chose Azure Blob Storage, Azure Data Factory, and Azure DevOps alongside desktop components such as Databricks CLI, PowerShell, RStudio Desktop, and Git.
We also installed RStudio Server to the driver node of the Databricks cluster. Each of these components plays a part in two sets of workflows we worked through for the data engineering and data science teams. This helped them understand real-world application of the tooling for their needs, but also helped demonstrate to them what actually works (as some available documentation is either incomplete or inaccurate) and how to perform the work in a recommended manner.
The first workflow we shared was one we deemed “DevOps.” It comprised 6 key tasks:
- creating and terminating a Databricks cluster
- using a Databricks cluster initialization (“init”) script
- installing RStudio Server on a Databricks cluster
- integrating with Azure DevOps
- creating and installing an R package for a model
- testing and measuring performance
Custom scripts were developed in a manner which enables repeatable and consistent execution of DevOps tasks. Databricks support for RStudio Server integration as of June 2018 was a highlight, as it brings a similar look-and-feel and functionality as the RStudio Desktop product used by the data science team. Plus, it includes the ability to execute code apart from Databricks notebooks, something the data science team requested.
Additionally, the data engineering and data science teams wanted to integrate directly with source control. We demonstrated several options for this purpose, including our own recommendation to use Azure DevOps, which the data engineering and data science teams have since adopted.
We called the second workflow we shared “data pipeline”. It also had 6 key tasks:
- creating a mount point to Azure Blob Storage
- accessing, reading, and preparing input data
- writing output data
- manually executing notebooks and models
- using Azure Data Factory to automate execution
- testing and measuring performance
As with DevOps tasks, data pipeline tasks were scripted as much as possible. The data engineering team was already moving in the direction of Azure Data Factory, so we used this product in the PoC to execute code in a batch manner. And as Azure Blob Storage is currently being used by the data science team with HDInsight, using Azure Blob Storage here with Apache Hive compatible Spark SQL showed how to migrate from HDInsight to Databricks in a future project phase.
The Result
After we presented our findings and delivered the scripts we developed, the firm decided to move forward with a recommended Pilot phase. The purpose of the Pilot phase is to migrate to Databricks a subset of ML models currently being executed by the firm in production using HDInsight. As part of this migration, testing and measuring performance are also planned.
Technologies used during this effort included Azure Databricks, Azure Blob Storage, Azure Data Factory, RStudio Server, Azure DevOps, R, Python, Databricks CLI, PowerShell, RStudio Desktop, and Git.