Why Visual Data Science Platforms are Here to Stay
By Ray Johnson
With the continued democratization of data science, the market has responded. Now, there are visual data science platforms for designing and implementing the analytic process flow and advanced analytic solutions. And they’re not going to disappear any time soon. This post takes a look at the background of these platforms and walks you through a few scenarios.
Traditionally the development of advanced analytic solutions can require the development, deployment and maintenance of complex code; dependent upon the capabilities of the employed analytic framework.
These platforms exist in order to:
- Reduce the complexity of development, deploying and maintaining predictive analytic solutions
- Provide solution consistency that is not solely dependent on the coding skills of the developer or their preference for a particular programming language
- Provide access to a much broader audience of Data Science practitioners
- Increase productivity of the Data Science team
As a brief refresher, the analytic process flow can be summarized as follows (CRISP):
- Business understanding
- Data understanding
- Data preparation
Additional considerations for implementing this process in code may be the level of control required, time to market, effort to maintain the code base or skill level of the Data Scientist/Data Engineer.
Some of the recently introduced tools have built-in support for big data, data transformation and blending, machine learning, exploratory data analysis, reporting, scripting language integration (R, Python) and source control. This provides an all-in-one environment for solution development and deployment. These tools also tend to take a visual workflow approach to solution development.
A lot of content has been produced of leveraging visual analytics. The purpose of this post is to illustrate how workflow oriented data science platforms can be leveraged to enhance the development of advanced analytic solutions versus a code-oriented approach in which the solution is developed from scratch using the typical scripting or low level language. In this post, each step of the analytics life cycle will be implement using a code-oriented and visual workflow-oriented approach for comparison. The well-known Iris dataset will be used to illustrate the differences in approaches in performing a simple clustering exercise.
Code-Oriented Solution Development
Let’s now walk through the solution development using a code-oriented approach using R-3.4.1 and RStudio-1.0.143. For this illustration, all of the required code will be presented along with samples of generated output. The point here is to focus on the code-oriented approach to the solution development. Keep in mind that this is a simple exercise for illustration purposes only. There are multiple ways to implement the solution and that what will be presented is not a reference implementation.
Following is a summary of the outcome that is trying to be achieve:
To find populations of Irises that are similar in features and to understand the relationship of their similarity to their observed species.
Iris data is being loaded from a text file; using comma separated values (CSV).
Following is a listing of the first few rows of the data set.
Various summaries and visualizations will be used to look for insights, anomalies, etc. It may also be necessary to transform the data, so let’s take care of that also. There are other techniques that can be used for exploratory data analysis (EDA) depending upon specific requirements. Following is the sample code used to generate a statistical analysis and explanatory visualizations for the Iris data.
The classification column is removed and the data ID scaled.
A simple k-means clustering is performed and the associated cluster identifier appended to the original Iris data. The approximate number of clusters was determined through visual analysis.
The result of the k-means clustering is presented as a scatter plot for review. Now the derived clusters can be compared to the observed clusters.
For the purposes of this exercise, deployment will not be addressed in detail as the focus is on the code-oriented and visual approaches for solution development. However, the code may need to be modified based on the deployment strategy.
From a code-oriented perspective, the solution is very straight forward; however, there may be variability based upon the analytic framework used, skill of the individual developing the code, or even personal preferences. Code maintenance also needs to be addressed as all analytic solutions evolve over time. The code-oriented approach allows detailed control of all aspects of the solution. The question as it pertains to solution development becomes, is there a point of diminishing returns? Should an abstraction layer be implemented to justify trade-offs in control versus a potential increase in productivity and clarity. After exploring the visual approach to building the same solution, the answers to these questions should be clarified. So now, how can a visual analytics platform be used to implement this same solution.
Visual Analytic Solution Development
KNIME (Konstanz Information Miner)-3.4.0 will be used to demonstrate the visual approach to building the analytics solution. KNIME is an open source analytics and reporting that is representative of the direction of visual analytic platforms incorporating workflow.
The following figure shows the clustering exercise that was implemented using R. The steps of the analytics process have been highlighted on tool’s canvas.
It should be pointed out that no code was written to develop the solution. This is the case with most of the tools that use a visual metaphor. Although, there does exist to capability to leverage scripting and external platform functionality (e.g. “R”, Python). Context menus generally provide access to node output. In this case, the Scatter Plot, Box Plot and Statistics nodes generated the following content.
From the Modeling and Evaluation steps, the results of clustering exercise can be reviewed.
Comparing the implementation of both solutions, the code-oriented approach developed using R provides a very fine level of control and provides access to all of the capabilities of the language.
For this simple exercise, the choices of the specific packages to use is straight-forward. The code-oriented development language used may provide multiple functions and approaches for implementing the solution including, development of the model or solution from scratch versus using pre-built packages. It is up to the development team to determine what is best for achieving the identified business outcomes.
From a visual approach perspective, the choices are initially limited to the platform’s nodes. However, low level capabilities may be accessible via scripting components.
Also, it is somewhat easier to follow the visual approach and understand the overall flow of the solution. Visual analytic platforms address the issue that some individuals just don’t enjoy writing code.
So, how should these visual platforms be incorporated into your advanced analytics solution development. Consider the following:
- There is a level of extensibility offered in most of these platforms; therefore, additional custom capabilities can be added.
- A defined process flow can be visually implemented. This helps add a level of discipline in the overall development of the solution and can help validate the solution approach. Additional annotations can clarify what is being accomplished at each step.
- Depending upon the selected platform, most ML(Machine Learning) methods are supported. This also includes connectors to the more common databases, files and other sources (streaming, big data platforms, IoT).
- It is fairly easy to modify the solution. The process flow can be readily updated. Also, nodes can be added to display additional visualizations and leverage multiple ML methods.
- The level of productivity may increase due to the focus on the solution approach and process.
- Some individuals don’t like to write code. Visual tools offer a viable alternative.
- Solution deployment can be simplified as some tools have server components to handle configuration management and scalability.
- Base functionality is limited to the nodes that are provided by the vendor and the supporting community.
- Custom models may not be easily implemented.
- A code-oriented approach provides a finer level of control and access to low level functionality provided by the selected development language.
- A code-oriented approach can offer better performance because of the opportunity to leverage natively compiled code.
- Connectivity may not be available for some data sources requiring the development of a custom solution.
- The visual advanced analytic tools market is nascent and growing. It can still be challenging to select a solution that provides the performance, functionality and required level of enterprise support.
A code-oriented approach or visual analytics platform both have their places. The visual analytics platform can provide and control the workflow with low level code or scripting providing access to missing functionality or additional options for data visualization.
So what does the current landscape look like? The Gartner 2017 Magic Quadrant for Data Science Platforms on KDnuggets™ list the current contenders in this space. Based on the current trends, there will definitely be increased growth in data science platforms that employ visual metaphors. These platforms should be considered a key component for augmenting traditional code-oriented approaches.