Considerations for Choosing a Data Science Platform
By Ray Johnson
With the proliferation of data science platforms how do you pick one that’s right for you? When choosing a data science platform, the following characteristics should be considered, in no particular order or priority (prioritization may depend on the specific business outcome being addressed or making trade-offs on one characteristic versus another to obtain the overall best fit).
Coverage refers to the core capabilities of the data science platform. Does the platform support all the key ML/statistical algorithms you require and their associated variants? If so, are the implementation details fully documented? Some level of vetting needs to be performed to understand how the algorithms were implemented and associated test cases that were used to validate their output. This is particularly important when the platform provides multiple implementations of the same algorithm.
Scalability & Performance
Scalability and Performance of the platform encompasses how well the platforms’ architecture uses its available resources.
In the case of code-based platforms and APIs, does the platform run native code or does it use JIT (just in time) compilation? For small data volumes (small is relative), this may not matter; however, for larger datasets with millions of examples, this can be a potential issue. Benchmarks are important but are not a substitute for observations of real world workload performance.
Some platforms support a client server model. This is important because development can be performed on a local client but executed on a remote server that has greater computing capability. This directly impacts the ability of the platform to scale up and increase overall performance.
GPUs & Parallel Processing
The ability to access GPU capabilities can provide incredible boosts in performance. Compute-intensive functions can be offloaded to the GPUs massively parallel architecture, while other tasks are handled by the CPU. To take advantage of this capability, the data science platforms’ underlying libraries must be optimized to recognize and take advantage of GPU capabilities. Even without GPU capabilities, the platform should be capable of leveraging multiple CPUs and cores.
There are several components to productivity:
- Amount of training required to use the platform
- Effort required to develop and deploy a solution
- Availability of collaboration and configuration management tools
- Platform complexity
- Skill level of resources
The first four are directly related to the data science platform. The fifth item may be more than the previous four. Some platforms are very code oriented; requiring a fair amount of programming skills. Despite the hype around the ease of development with certain languages, not everyone has the aptitude to write code. In fact, a significant amount of time may be expended on resolving coding issues rather than focusing on solving the problem at hand.
When evaluating productivity, the consideration of whether a data science platform is code oriented or GUI based should carry a little extra weight if programming skills are a little lite.
Some data science platforms are completely cloud based (PaaS – Platform as a Service) or are on-premises and access cloud resources. Specifically, big data resources. Native integration with the Hadoop ecosystem is becoming an increasingly important capability when considering the volumes of data being stored in the cloud.
Cloud capabilities enhance the ability of the platform to access vast amounts of data and in the case of PaaS, provides the capability to scale compute resources.
The interface or lack thereof can make or break a data science platform depending on your specific requirements. The platform’s interface can be thought of in terms of its API and its GUI. The interface has a direct impact on the overall productivity of the platform.
From the API perspective, an open well documented interface makes it easier to integrate platform functionality into custom applications. Some IDEs have plugins that provide direct access to the platform’s API (code completion, syntax hits, etc.).
From a GUI perspective, a good GUI can increase overall efficiency and productivity by providing a way to visualize machine learning/analytic processes, select algorithms, shape data, and define parameters.
Community & Support
Community is very important, especially when leveraging open source data science platforms. A vibrant community provides a level of confidence that the platform will evolve and be enhanced. As new algorithms, techniques, and approaches are developed, they will be picked up by the community and incorporated into new platform functionality.
The community and its associated forums will also be where support can be obtained for issue resolution or where questions can be asked.
Many data science vendors wrap open source solutions within their own frameworks. This should not be discounted. Data science vendors can provide support above and beyond what may be found via the open source community. It may be more palatable to have a single point of contact with paid support versus searching a forum for a support solution.
How easy is it to deploy the final data science platform solution? What good is the solution if it is difficult or impossible to place it into production? For example:
- Does the platform allow embedding into custom desktop and web applications?
- What capabilities exist for parameterization?
- Are web services supported?
- Is there an open API available?
Does the data science platform support security? Does user level security exist? The need for security will be driven by data, compliance and audit requirements. The ability to define security by role may be necessary. This may also include row level security and the ability to anonymize specific attributes.
Information and business outcomes must be balanced to ensure that they do not conflict.
There is always an associated cost for any data science platform. Whether it is open source or proprietary. There may be hidden costs associated with the acquisition of hardware, training and licensing fees. There can also be costs associated with cloud based platforms where there is a requirement to pay for compute and storage. Also, is there a desire to pay for value added support from a vendor. Overall cost can be reduced by leveraging an open source platform or using the community editions of vendor products but paid support should not be overlooked.
The above list of considerations for choosing a data science platform is not exhaustive. It is a starting point on a journey of selecting data science platforms. I am interested in hearing about considerations that others feel may have been not included/overlooked.