X

This site uses cookies and by using the site you are consenting to this. We utilize cookies to optimize our brand’s web presence and website experience. To learn more about cookies, click here to read our privacy statement.

Battling Data Leakage in ML Models: Lessons from the School of the Art Institute

With the intent of predicting whether students who are accepted into the school would enroll, the School of the Art Institute of Chicago (SAIC) developed machine learning models. But, when they put their models into production, they found something had degraded the model’s predictions in production.

It was data leakage, a common and silent enemy in the world of machine learning (ML) that threatens to undermine ML models’ performance. Having seen this before, SPR offered to help SAIC build a model that was data tight. The project was the result of SPR donating $50,000 of machine learning services in honor of the company’s 50th anniversary. We worked with SAIC and built on the school’s prior work and prior success in data science, to create a new ML model that would help the school more accurately predict enrollment.

In this article, we explore data leakage, its impacts, and most importantly, the ways to guard your machine learning model against it – just like we did for SAIC.

Understanding Data Leakage

Put simply, data leakage refers to the scenario where information from outside your training dataset is used to create the model. As this blog will explore, subtle decision in the data science process can result in data leakage. And the problem is that data leakage leads to an overly optimistic evaluation of your model during the testing phase – as if the model is “cheating” by having access to information it shouldn’t have. The result is a model that performs exceptionally well during training and testing but falls short in its ability to accurately predict in real-world production scenarios.

From our experience, data leakage can occur in two main ways. The first is leakage due to subtleties related to preprocessing. This happens when information from the testing phase influences the preprocessing of training data. For example, if you scale your features based on the full dataset before splitting it into training and testing sets, you are effectively giving your model access to future information. Second, leakage happens during predictive model building – when the model is trained using data that will not be available during prediction time.

3 Reasons to Address Data Leakage Now

The harmful impacts of data leakage are multi-fold. It can cause:

  • Overfitting: Data leakage can lead to overfitting, where the model learns the noise and anomalies in the training data. This means the model performs extremely well on the test data, but poorly on unseen data, reducing its generalizability.
  • False Confidence: Since the model performs exceptionally well on the training and testing data due to leakage, it can lead to false confidence in its ability to make accurate predictions.
  • Poor Performance in Production: Despite the impressive training and testing scores, the model will underperform in production because the conditions in the real world are different from those during its development.

4 Ways to Save Your ML Model

Preventing data leakage is crucial for building a reliable and robust ML model. Consider the following activities:

  1. Careful Data Preparation: Always split your dataset into train and test sets before any preprocessing or exploratory data analysis. This ensures that information from the test set does not influence the training set.
  2. Mindful Feature Selection: Be cautious about the features you include in your model. If a feature includes information that won’t be available at prediction time, it's likely to cause leakage.
  3. Cross-Validation: Use cross-validation to evaluate your model's performance. It reduces the chance of overfitting and allows you to get a more honest appraisal of your model's capabilities.
  4. Time-Series Data Handling: If you’re dealing with time-series data, always ensure that your model is trained on past data and validated on future data. This simulates the real-world application of the model.

The best way to ensure you are meeting the four best practices above is by treating data science as a team sport. Within each team the responsibilities can be delegated and reviewed by the other team members. By delegating responsibilities, each person can critically view their segment within the machine learning process and surface any issues early. In our project:

  • The School of the Art Institute was responsible for gathering the data and ensuring all data would be available at the time of prediction.
  • SPR prepared the data for the machine learning model and created the model.

Having each team member responsible for a segment of the machine learning process reduced the chance of errors at each stage.

ML Models to Stand the Test of Time

Data leakage is a common pitfall that can result in your machine learning model underperforming in production. Being aware of it, understanding its implications, and knowing how to mitigate its impacts can improve the reliability of your model. Remember, building a model is not just about attaining high accuracy during training and testing phases but ensuring its robust performance in real-world scenarios. It's about creating models that stand the test of time and continue to deliver reliable predictions in production.

"It’s important for developers to test and learn – and one of those ways is through knowing what NOT to do,” said Kyle O’Connell, Director of Enrollment Analytics and Forecasting at SAIC. “Reworking our ML model to omit data related to students that enrolled after they accepted was a crucial piece in improving our ML model and in turn, helps us better understand the students who are most likely to enroll.”

In the end, understanding data leakage and knowing how to prevent it is not just a step to build better machine learning models. It’s a crucial step toward responsible and effective AI. By ensuring our models are free from data leakage, we’re not only improving their performance but also creating a more reliable and trustworthy AI environment.