Antoni Gaudi was a Spanish Architect in the last 1800's where he said, "There are no straight lines or sharp corners in nature. Therefore, buildings must have no straight lines or sharp corners"
If we agree with the premise that there are no straight lines in nature, it should also follow that with we should be careful with using linear machine learning models in practice. There are certainly good use cases for Linear Models - but we have to be sure that the data in the range of we making predictions does in fact have a linear relationship.
However, the real world is messy and inherently non-linear. How can we handle this case in Machine Learning.
In this post I will use a dataset of concentric circles, which are clearly non-linear, to show how linear Scikit-Learn LogisticRegression fails, and then show how to use a Scikit-Learn RandomForest and a Tensorflow/Keras Neural Network to capture and predict using the non-linear data.
The goal of this article is see how to apply machine learning models to non-linear data. It is out of scope to do a deep dive on any one particular model.
We will use this dataset to train models to make predictions to see how each model captures the non-linearity of the data.
This article assumes that all dependencies have been install on your local machine or you are using a hosted environment such as AWS SageMaker.
You will need to include:
This article will assume all of the necessary components
Import the necessary libraries. We are using sklearn ( SciKit-Learn ), Tensorflow and Keras. Notice that this example assumes Tensorflow 2.0 with Keras included.
from sklearn.datasets import make_circles from sklearn.model_selection import train_test_split
from tensorflow.keras.models import Model, Sequential from tensorflow.keras.layers import Dense from tensorflow.keras.optimizers import SGD
import numpy as np import matplotlib import matplotlib.pyplot as plt %matplotlib inline
Create the circles dataset
Sklearn has a very convenient helper function for creating the circle dataset called, make_circles.
X, y = make_circles(n_samples=2000, factor=0.6, noise=0.1)
Factor - scale factor between the inner and outer circle
Noise - Standard deviation of Gaussian noise added to the data
Look at the shape of the dataset to make sure it is what we expect.
Create Training, Testing, Holdout datasets
At a minimum we must divide our dataset into a training and testing dataset. You never want to test your model on data that it used for training. Many times you also want to create a holdout set, because the training and testing can be used during model train to verify accuracy, but the holdout is only ever used to test the generalization accuracy of a model.
Using a training, testing and holdout starts to be more important when you perform deep learning as we will see later.
Create a holdout dataset of 10% of the data, a testing dataset of 25% and a training dataset of 90% of the remaining total dataset.
View the testing and holdout dataset accuracy by calling score on the model.
logreg.score(X_test, y_test) = 0.48
logreg.score(X_holdout, y_holdout) = 0.45
As expected using a linear model to predict on non-linear data did not do well - in fact its worse than just guessing.
View holdout dataset performance
Lets look at how the holdout set performed by plotting the actual targets values overlayed with the predicted values. Any red circle with a blue center, or a blue circle with a red center is where the prediction was incorrect.
This is where the train, test and validation datasets really come into play. Notice we fit the model on the training data, and provide the test data as the validation set. This way TensorFlow/Keras can use the test data to calculate the testing accuracy and loss.
When we evaluate the model after 50 Epochs we see the following:
The Training Loss/Accuracy plot looks very good. It shows that really after 200 Epochs the model did not improve, nor did it have a lot of room to improve. We could have stopped the Epochs around 250 in this case.
Which has very good accuracy performance or 99.5%. Notice that the Neural Network is not very complicated, nor is it very deep and yet it performs very well.
You can see from the holdout set there is one lone incorrectly predicted data point.
In this article we looked at how to address non-linear data in Machine Learning. The world around us is inherently non-linear. Even though we might be able to model linearity, that linearity might only hold for a certain range of values as as you get to the extremes of the values non-linearities start to present themselves.
Understanding how to apply non-linear models to data is important. There are Machine Learning algorithms that are non-linear, such as Decision Trees or Deep Learning Neural Networks which are inherently non-linear.
I have been asked, "since Deep Learning can be applied to both Linear and Non-Linear data, and Deep Learning can solve problems that Machine Learning can solve, should I always use Deep Learning?"
My feeling is no. You should start with Machine Learning techniques because they are simpler and depending upon your dataset, they could work very well. There is a principle called, 'Occams Razor' which essentially states that the simplest explanation for a solution to a problem is likely the best solution.
Starting with Machine Learning techniques is almost always the simpler approach if it works. You can always scale up to a Neural Network if needed, but you will need more data, CPU, GPU, etc to get really good results.
Ready for what's next?
Together, we can help you identify the challenges facing you right now and take the first steps to elevate your cloud environment.