X

This site uses cookies and by using the site you are consenting to this. We utilize cookies to optimize our brand’s web presence and website experience. To learn more about cookies, click here to read our privacy statement.

Why Data Quality Matters When You Start Using Analytics, AI

Author: Nancy Kastl Co-author: Steven Devoe Posted In: Data, Machine Learning, Testing

The analytics world has exploded over the past decade and is expected to continue to grow at astonishingly high rates for the next decade. As the excitement has grown, organizations are always eager to realize the value of the latest and greatest analytics tools within their organization whether that be new AI models or incorporating new data assets into their business intelligence strategy, and as a result steps to ensure high data quality are expedited, overlooked, or sometimes skipped altogether.

What Happens When You Don’t Prioritize Data Quality

The horror stories that have been written because of poor data quality seem to be endless, but they typically take a few forms, starting with low data quality for whatever reason.

The first form of story starts well, until you get to the analytics portion of the project and it becomes incredibly obvious that your analytics tool is inaccurate. This usually results in a delayed project timeline, a lot of unnecessary stress, some finger pointing, or even worse the project gets abandoned. As bad as this sounds, this is arguably the best-case scenario as the damage done is limited to the internal organization.

The second form of the story starts about the same, but this time the data quality issues are a little less noticeable and the analytics tools make the rounds throughout the organization, or, even worse, into external hands. How long it takes for someone to notice that things seem “off” varies, but it inevitably will happen at some point. Hopefully, the impacts are small but often the low-quality data has been used to make decisions with consequences to the organization totaling very large sums of money.

In either case, it doesn’t take long within an organization before you start hearing things like “We just can’t do it,” “our data quality is poor, and I don’t trust it,” or “all our analytics projects have failed.”
The business case for data quality is compelling when you consider:

  • 12% of revenue is lost to poor data quality (Experian Data Quality)
  • Bad data on average can cost $15M per company in the United States (Gartner)
  • 50% of a company’s time is spent on data issues  (Thomas C. Redman – Data Quality Solutions)

"While 76% of C-Suite executives have AI and machine learning initiatives included in their company’s roadmap, 75% aren’t confident in the quality of their data." ~Trifacta

,

Defining Data Quality

Data quality is often discussed throughout organizations in very nebulous terms, which creates an adverse reaction when the idea is brought up in discussion. Data quality can be defined in very specific terms for analytics use cases:

  • Completeness – the data has all the relevant values you would expect it to have. Data isn’t missing or truncated.
  • Consistency – the data doesn’t conflict between systems or sources and a source of truth exists within the organization.
  • Uniqueness – the data isn’t duplicated within the system.
  • Timely – the data is updated frequently enough to be used appropriately in the analytics tool.
  • Accurate – the data is correct and free of inaccuracies. Data is correctly formated and calculations produce the correct values.

What Causes Data Errors?

It is a challenge for companies to obtain high-quality data due to various reasons. Firstly, there is an increased demand from business teams and IT teams to provide data quickly for making informed business decisions. Secondly, the volume and complexity of data are growing at an exponential rate to meet the requirements of big data use cases. Lastly, constant technological advancements, even minor updates, can cause disruptions in data analysis and reporting. Some of the changes that can lead to such disruptions include

  • vendor updates to enterprise applications such as Salesforce, SAP, Workday, ServiceNow, Microsoft Dynamics, Oracle, etc.
  • vendor updates to reporting and BI platforms such as Tableau, Microsoft PowerBI, Cognos, Qlik, SAP Business Objects, etc.
  • new releases of custom applications
  • changes in data models or business rules
  • new data sources or enriched data by third-party Data-as-a-Service vendors
  • and modifications in data lake/warehouse and integration infrastructure.

It is essential to thoroughly test new or changed functionality/features to ensure the correct functioning of applications. Similarly, it is imperative to conduct data tests to ensure the quality of structured, unstructured, or semi-structured data.

Enabling Analytics Through Testing

Data must arrive intact and fit for purpose as it travels between your business data source systems, enterprise data warehouse/data lake, and data marts/methods to your data analytics reports, business dashboards, and AI/ML initiatives. Data errors must be caught early before they become costly issues to your business.

Some questions addressed by data testing are:

  • Did the problem originate in the source data?
  • Were there issues with the transformation logic?
  • Is the report or process pulling from the right data mart?

Because different types of errors are introduced as your data makes its way through the data journey, various types of tests are needed.

Pre-screening checks: During ingestion of data from business data source files, test metadata and format to check for no empty values, field type, minimum/maximum value, maximum value, etc. for a clean start on its journey.

Field-Level Tests: As data flows from the data warehouse to data marts, test business logic for aggregation, value range, and transformation tests.

Vital Checks: Based on system constraints, test for uniqueness, completeness, nullness, and referential integrity to catch more errors.

Reconciliation Tests: ETL validation for detailed source to target comparison tests ensure each row of data arrives to its destination.

Report and App Testing: UI and API tests of the analytics tools, business applications, and integration points for final display of data.

Traditional data testing methods are not enough to catch data errors. Manual “stare and compare” data checks are painfully slow and data sampling leaves dangerous gaps, especially when migrating data to new infrastructure or consolidating systems.  It is difficult, costly, and time-consuming to detect errors in millions of rows of data from multiple systems. Automated spot checks are insufficient to prevent data errors, as they do not provide comprehensive coverage.

To address this issue, we employ an automated end-to-end data testing approach that is customized for each step of the data journey. These tests are conducted continuously within your CI/CD pipeline to ensure consistent data validation. Building automated data tests requires a collaborative effort among a data engineer who is proficient in data systems, a data scientist who understands data quality and context, particularly in machine learning capabilities, and a test engineer with expertise in automation and test design.

DataOps emphasizes the data pipeline, with clear communication among all stakeholders. Gartner defines DataOps as “a collaborative data management practice focused on improving the communication, integration, and automation of data flows between data managers and data consumers across an organization.” Additionally, “automated testing-removing bottlenecks arising from traditional, human-intensive testing approaches-enables higher quality amid more frequent delivery of new capabilities.”

Fortunately, there are now automated testing platforms available that enable end-to-end data testing, UI functional testing, and API integration testing across an enterprise’s commercial, homegrown, and legacy applications. With the right people, processes, and tools in place, achieving data quality goals is possible.