Implementing a Data Quality Strategy

Version: 2.0.0


Man fustreated with dirty data

According to Experian’s 2021 Global Data Management Research, “Fifty-five percent of business leaders say they lack trust in their data assets, hindering their ability to be fully data-driven.”

What are Data Quality Issues?

Data quality issues range from basic problems like duplicate data, format issues, and invalid data to more complex challenges such as data bias, ambiguous data, and unstructured data. View a more comprehensive list of data issues here.

Steps to Better Quality Data

Sometimes it’s obvious where the issue is coming from - it’s always been around, and someone knows how to fix it temporarily to get a report out in time. Sound familiar? But to start fixing things, you have to ask a number of questions. I call it discovery! Discovery is about asking questions, the foundation of your strategy to clean dirty data.

Discovery

Part of ‘Discovery’ is to understand the data quality issues and their impact on business. Is it easier to wrangle the data post-data entry than it is to change systems and processes that will lead to better data? Changing well-defined processes that may inconvenience a customer for better data input may have an adverse effect - especially if you can fix it relatively easily. It’s important to understand at which stage you will tackle that data issue. It’s important to frame it in the value or risk associated with it to the business - you must be able to measure the impact in some way.

“No data is clean, but most is useful.” — Dean Abbott

Understanding the lineage of your data will greatly help. Where the data comes from, how it is transformed into new data points, what calculations are done, and how that is presented in reports or dashboards. Normally, many aspects of definition (see below) are already in place; however, more than likely, it will be incomplete. If you can’t show the lineage of your data, how can you trust it?

Data Lineage is important in tracking data quility issues. There are a number of Cloud services such as Snowflake and Azure that provide a suite of tools that can help.

Defining data is critical in cleaning bad data but also maintaining quality data. It’s an important part of the strategy to maintain quality data sources to help drive data-decision making to achieve positive business outcomes.

Defining Your Data

This includes:

  1. Data Catalogue
  2. Data Structure
  3. Data Lineage
  4. Data Classification
  5. Data Ownership

Defining your data will go a long way in ensuring you can tackle data quality issues quickly and efficiently. But most importantly, identifying who owns the data is equally important. If you can’t identify an owner, can you assign one that you can work with? The problems in quality may be technical, human, governance-related, and/or due to a lack of operational systems that allow for issues to creep in.

“Without clean data, or clean enough data, your data science is worthless.” — Michael Stonebraker

To bring it all together, you need data management and operational and technological systems in place to reduce and eliminate data quality issues. The list below is a good start in what is needed to bring everything together to deliver on your data quality strategy. Data Teams, Data Owners, and Data Champions are vital as they are the glue to make things work. People are at the heart of solving problems, implementing solutions, and delivering outcomes that align with business strategy.

Data Management

Summary

Data quality issues can range from basic problems like duplication and formatting errors to complex ones such as data bias and unstructured data. Tackling these challenges requires a strategic approach that begins with data discovery—identifying and understanding where the data quality issues arise, whether at the source or during the transformation and analysis phases. The next step is defining your data through a data catalog, structure, lineage, classification, and ownership to ensure clarity and accountability. All the while, viewing through the lens of the impact it has on the business. We are not looking for perfection; we are looking for quality data to help us gain meaningful insights to deliver business outcomes.

Having a clear strategy and systems in place will help address the problems that arise. Discovery, definition, and management will form the basis of a data strategy that resolves data quality issues and provides the foundation for building data products that deliver valuable insights.

References

Table of Data Issues

Title Description
Duplicate Data Multiple records for the same entity, such as a customer appearing more than once in a database.
Redundant Data Data that is repeated unnecessarily, such as multiple copies of the same file or record stored in different locations.
Orphaned Data Data that has no link to other data, such as a transaction record without a corresponding customer record.
Incomplete Data Missing critical information, such as an address without a postal code or a customer record without a phone number.
Truncated Data Data that is cut off or incomplete due to system constraints or field limitations, such as a name being shortened.
Inaccurate Data Incorrect information, such as misspelled names, wrong addresses, or incorrect numerical values.
Outdated Data Information that is no longer current, such as old contact details or outdated product information.
Conflicting Data Data where different sources or systems provide conflicting information, such as different addresses for the same entity.
Invalid Data Data that doesn’t meet specific validation rules, such as an email address without a valid format (e.g., missing “@”).
Data Format Issues Data that does not conform to expected formats, such as text in a numerical field.
Data Type Mismatch Data entered in the wrong type or format, such as letters in a numeric field.
Data Integrity Violations Data that violates predefined rules or constraints, such as duplicate primary keys or foreign key mismatches.
Inconsistent Data Data that varies in format or structure, such as dates recorded in different formats (e.g., DD/MM/YYYY vs. MM/DD/YYYY).
Non-Standardized Data Data that lacks consistent standards for naming, classification, or measurement, such as different units of measure.
Ambiguous Data Data that lacks clarity or context, making it difficult to interpret, such as unclear abbreviations or codes.
Outliers or Anomalies Data points that are significantly different from the norm, potentially due to input errors or exceptional cases.
Noise Irrelevant or meaningless data, such as extraneous characters or unintended values.
Unstructured Data Data that is not organized in a predefined manner, such as free-text fields or social media posts.
Hidden or Dark Data Data that is collected but not used or analyzed, often because it is difficult to access or integrate.
Data Bias Systematic errors or patterns in data that distort results, often caused by biased data collection or selection.