Below is a brief survey of factors that led our client to move from skepticism to adoption.
Databricks just works
At the outset, the client stated they already tried Databricks it didn't work out. Databricks hadn't been made generally available in Azure until recent months, so the question was, to what extent had they worked with the product? It turned out, the staff lacked needed skill sets and didn't have the necessary bandwidth to pick these up. Our team was brought on board to address this roadblock.
The first client directive for the pilot phase was to migrate a subset of ML models with no changes to the original code. One of the primary reasons for not changing the code was to prevent downstream effects from introducing change to the code base. The only programmatic changes we ended up needing to make were to address data quality issues, some due to migrating from Apache Hive to Spark SQL as part of the process.
We rigorously compared model output from HDInsight and Databricks for full production datasets, and found only one output value of dozens which differed, affecting the latter three models we migrated. In reviewing these differences with the client, it was determined that these were due to changes to Spark function "percentile_approx" over time, as the versions of Spark used by HDInsight and Spark were different.
Note that we used a relative tolerance of 0.000000001 when comparing model output, and when adjusting for more tolerance the differences for this sole column disappeared. Regardless, the minor deviance experienced was considered immaterial by the client.
No needed changes to library packaging
While this second part of the case study mentions that we encountered and surmounted 14 specific technical implementation issues during the pilot phase, keep in mind that we had already become familiar with Databricks by building a PoC; during that time we also addressed specific client obstacles associated with their initial trial runs in prior months.
One of the remaining stumbling blocks for the client team had been the ability to install proprietary R packages that they had developed, in contrast with R packages made publicly available via CRAN (Comprehensive R Archive Network). After demonstrating how to do so via notebooks, we also advised to move installations of all needed R packages to Databricks cluster initialization ("init") scripts so that these would be available to all cluster nodes used by executing models.
Integration with chosen Git implementation
Databricks did not initially offer out-of-the-box support to specifically address integration with Azure DevOps, a product which also had just recently become generally available. But this lack of support was really a limitation of the notebook user interface, which offered minimal explicit support for Git compatible products.
Since we did not want any unneeded user interface dependencies, we first showed the client team how to programmatically integrate with Azure DevOps, and later also implemented automated syncing of notebooks to Git. This would minimize the need for developers to do so while programming, and also store all code in the same centralized location being used by client teams.
Notebook alternative exists
While the client made it clear that modelers on their team used a variety of development tooling, they also made it clear that they were highly averse to using notebooks. While notebooks have quickly become widely popular for data science in recent years, they were accustomed to programming via command line and an IDE called RStudio Desktop, and did not like the idea of being constrained to using notebooks for development, as they felt some scenarios (such as for debugging) were a bit challenging.
Our solution made use of RStudio Server, the installation of which we included in our cluster initialization ("init") scripts, to provide the option to make use of a browser based version of RStudio Desktop. Also, since the Azure implementation (unlike the AWS implementation) of Databricks does not permit SSH to clusters, RStudio Server additionally provided command line access similarly to RStudio Desktop.
And, as the notebooks that we developed intermingled Python to execute pipeline code, R to execute models, and SQL for exploratory work, by the end of the pilot phase client teams began to see this as an advantageous feature.
Hadoop goes away
Reiterating this content should be considered only as an afternote, several challenges were either expressed by client teams or experienced by my team during our migration work from HDInsight to Databricks. As we needed to compare data output from models executed in HDInsight and Databricks, this required us to execute first on HDInsight, which meant we needed to make use of Apache Hadoop YARN for resource management and job scheduling and monitoring.
The HDInsight cluster was being launched at the beginning of the day and taken down at the end of the day, and Hadoop processes were sometimes unavailable, preventing the ability to resume our testing. When this issue did not surface, we were still concerned with configuration—the improper setting of which resulted in obscure error messages.
After migrating the first model, the client was happy they didn’t need to fiddle with YARN. Databricks is not built on Hadoop, and does not have the legacy constraints of Hadoop. And, since it had taken hours each day to bring up the HDInsight cluster used by client teams, bringing up Databricks clusters on an as-needed basis was seen to be a significant benefit.
Since the client’s focus was on the Apache Spark component of Hadoop, we weren’t encumbered with unnecessary legacy constraints, not to mention the much wider available selection of Databricks cluster node configurations.
To learn more about Azure Databricks or other options within the Azure ecosystem or to discuss how your business could benefit from machine learning models, please contact us.