"Getting the most out of your data" is a claim touted by many a software toolset these days. A commonly expressed problem is that most organizations cannot use these tools out of the box – they need to have clean, well-structured data first. This type of data is crucial for analysis as it allows you to clearly related each factor and response in your data set. Getting data into this format can be challenging – a task achieved through the application of data cleaning and data governance.
Data cleaning can take many forms – from copying/pasting data tables through validating each data point in a report. More robust tools exist to organize your data (think pivot tables, power query, or vlookup in Excel), but they require work upfront before you can realize any improvements to the quality of your data. Compare this to data governance – a process where you define the ideal data structure that addresses the data analyses needs and the criteria for quality data. In this post, I compare and contrast data cleaning and data governance and explore where one vs. the other is appropriate. In other words, what are some means to control for quality data and clean it up once it has been gathered?
What are data governance and data cleaning?
Data governance and data cleaning both fall under the category of “getting data ready for analysis”, but the main difference is how and when these methodologies are applied.
Data governance focuses on data quality by means of engineering and procedural controls before and during data collection (more on that below) – it is a proactive approach to data management. This can include rules, methods, philosophies, and tools (e.g., data capture templates) for getting well-structured and quality data into and out of your system. These can extend to linking your data into a single table, and ways to plan for how to best use your data. In each of these cases data governance is a tool you use to gather right data the first time.
Contrast data governance to data cleaning – an exercise where data is cleaned such that it makes it amenable to analysis. Data cleaning is a reactive process where you create or enforce rules, develop data cleanup methods, apply philosophies, and develop tools for data after it is collected – all in the name of taking data from dirty to clean. Data cleaning is a response for correcting shortcomings in the data you have already collected. This reactive approach sounds bad on paper, and often is bad because you are limited in what you can correct once the data is captured. But for better or worse it is often the approach chosen because it requires less upfront consideration, or is an unavoidable step due to data compatibility issues.
A simple example…
Data governance and data cleaning approaches do not exist in a vacuum – either approach must be developed to support the real-life process driving the data generation, and the way the data will ultimately be used – i.e., they must consider the entire data lifecycle. For example, consider a team that tracks the inventory of chemicals in their lab via freeform text entry. Their data is captured in the below table that has the same chemical listed five times.
|Bottle||Chemical Name||Concentration||Amount||Lot Number|
Table 1. A list of chemical names, concentrations, amounts, and lot numbers for five entries in a chemical inventory list. Entry errors exist in each column such as units in data fields and replication of entries for chemicals of the same lot.
The obvious issue with this table it that the “chemical name”, “concentration”, and “amount” columns do not have consistent data. We have two approaches to deal with this data set:
The data governance approach (performed prior to data collection):
Standardize the “chemical name” column by use of an ontological system of “Chemical Name” > “State” with a set of predefined terms. We would end up with one option for users: “Sodium Hydroxide > Aqueous”. This format provides an extensible solution in case we ever add other chemicals in this inventory such as “Ammonium Hydroxide > Aqueous”.
Standardize the “Concentration” and “Amount” columns so that units are never listed with the value and are instead part of data table headers. Require entries to be numeric and limited to being measured in certain units. We would end up with numeric data in the “Concentration” and “Amount” columns for every row.
Set up rules about what you are tracking. In Table 1, if the goal is to track unique chemical lot numbers, it may be better to split the table into two tables, one with a unique entry for each lot (i.e, two rows where “Bottle” entries 0, 2, and 3 should be combined, and “Bottle” entries 1 and 4 should be combined), and one table with an entry for each bottle and a reference to the lot only (not the properties of the lot). On the other hand, the single table approach can work fine, but it’s necessary to put data validation controls in place to ensure consistency of lot data entry (i.e., prevent conflicting data for the same lot.) In practice, a good database setup can enforce all this hygiene for you automatically.
The data cleaning approach (performed after data collection):
Standardize the “Chemical Name” column by building a dictionary of synonyms for sodium hydroxide so that all values can be set to a common name such as “Sodium Hydroxide”. During cleanup we might use fuzzy logic to capture spelling errors and then use a common chemical database (such as CAS) to anticipate each name that might be entered by a user.
Standardize the “Concentration” and “Amount” columns by extracting numeric values. Standardize or convert the units of the “Amount” column and determine if conversions are needed across the unit systems. We could use "regular expressions" (a text pattern analysis tool) for this.
De-duplicate the data by summarizing by the lot number.
In these approaches, data governance is synonymous with the development of engineering and procedural controls that govern the way data is entered, whereas data cleaning is done via the application of software tools after the data is entered. While you might say that either approach can address all the problems with the data set, that is false. Developing a data governance policy requires standardizing what quality data looks like, something that is not feasible for every organization (such having 5,000 employees agree on an ontological system). Data cleaning might address this by standardizing all names after data has been collected (each employee can enter whatever they want), but then the idea of what quality data looks like lies in the hands of a select few data scientists. This can be quite alarming if the team's data scientists do not have domain knowledge of the problem to be solved. These tradeoffs mean each organization will likely deploy a combination of both data cleaning and data governance that reflect the day-to-day practices of their scientists, the needs for their data analyses, and the underlying business goals.
How do I deploy data governance and data cleaning?
Deploying good data governance and data cleaning policies requires your organization to think carefully about your data requirements. A data governance policy should address:
The data you need to collect and purposes it will be used for.
The methods for data collection.
The criteria for data quality.
The structure of the collected data.
The methods for enforcement of governance rules.
Data cleaning is a bit more difficult. You cannot always anticipate of the needs of your data cleaning algorithms until you see the data, but you can follow some general best practices:
Do not remove ‘outliers’ unless you have good reason to do so.
Remove or hide data that is irrelevant to your immediate use case.
Standardize data and naming conventions.
Consult domain expertise before applying a large change to the data.
As your organization develops, you can try addressing common data cleaning operations via implementation of new data governance policies. This means that data cleaning will always be the first line of defense against poor quality data and the feeder for what new data governance policies to implement. This could be as simple as cleaning data to extract common chemical names to deploying a global data quality initiative using a tool such as Riffyn which provides both data governance and cleaning capabilities designed to work in concert. Moreover Riffyn is flexible and adaptive to a scientists needs – it works to avoid excessive constraints on a scientist’s work, while maintaining data quality.
Using the Riffyn SDE can help you govern your data and define methods for creation of well-structured data tables. It provides cutting-edge approaches to data organization using patented technology and a design-first approach. If you’d like to learn more, send a note to email@example.com.
More programmatic tools might include applymap() or fillna() in Python.
Algorithmic approaches like k-means clustering and gaussian mixture modeling can be used to “clean” data, but really are used once the data has been structured.
When we talk about “dirty” data, we mean data that is not ready for an analysis that a user is interested in. This can take many forms – data spread across spreadsheets, in the wrong structure, in merged cells, etc. “Clean” data refers to data that is in a format that readily enables a desired analysis.
Jack Morel is a Senior Scientist at Riffyn. Jack previously developed high-throughput screening capabilities for late-stage biologics development. In these roles he worked with cell culture, purification, and analytical departments to develop new process models and build new data reporting pipelines. Jack now works with both bench scientists and data architects to help them get the most out of their scientific data.