Data Hygiene
What to clean before it hits production, and the pitfalls that quietly cost you moneyMost operational data quality problems are not caused by exotic edge cases. They are caused by a small set of recurring issues that creep in at the point data enters a system, accumulate quietly for months, and only surface when someone tries to bill, forecast, or report on the back of them.
Six Principles Worth Living By
Cleanse on the Way In, Not on the Way Out
It is tempting to load raw data and clean it later inside reports or dashboards. This always costs more in the long run. The same cleansing logic gets re-implemented in every report, slightly differently each time, until you can no longer tell which version of “total revenue” is the right one. Run cleansing as a clearly defined step before data lands in the production tables, document what was changed and why, and treat the production layer as a clean source of truth from which all reporting flows.
⚠️ Pitfall: If you find yourself writing IF/AND/OR statements in Power BI or Excel to fix data values, that logic belongs upstream in your ingestion pipeline. Every report that re-implements it is a bug waiting to happen.
Preserve the Raw, Never Overwrite It
Always keep an untouched copy of the source data in a staging area. Cleansing should produce a transformed copy, not a destructive update of the original. When something goes wrong six months later – and it will – you need to be able to reproduce exactly what came in, what was changed, and what landed in production.
Validate Before You Transform
Before applying business rules, run cheap structural checks: is the row count plausible, are required columns present, are date ranges within reasonable bounds, do the totals roughly match the source system. A failing structural check should stop the pipeline and alert someone – not silently produce a half-loaded dataset that takes a week to spot.
Make Every Change Attributable
Every cleansed row should carry, at minimum, a flag that says it was modified, a reference to the rule that modified it, and a timestamp. This is what lets you answer the auditor’s question – “why does this row in production look different from the one in the source extract?” – without trawling through code.
Quantify the Cost of Being Wrong
When you find a data quality issue, attach a number to it: how many rows, how much spend, what proportion of the dataset. “We have some duplicates” never gets prioritised. “We have 2,847 duplicate records representing £143,000 of double-counted revenue” gets fixed on Monday. The numbers turn data quality from a technical problem into a commercial one – which is the only framing that ever attracts budget.
Treat System-Generated Data with Healthy Scepticism
There is a persistent assumption that data produced by a computer system must be correct, simply because a computer produced it. Systems generate data that is internally consistent – meaning it follows their rules and survives their validation – but that is a much weaker guarantee than being right.
Bad data finds its way into clean-looking systems through surprising routes: free-text fields repurposed for structured values, workaround entries to pass validation, overnight jobs that run twice and double a day’s data, device feeds that silently report zeros instead of nothing. None of these are detectable from inside the system that produced them.
⚠️ Pitfall: If a number from a system feels surprising – too high, too low, too round, too consistent – investigate before you report it. The instinct that something looks wrong is often the only signal you’ll get. Dismissing it because “the system generated it” is how bad data ends up in board packs.
The Pitfalls That Quietly Cost You Money
Each of the issues below is encountered in roughly nine out of ten datasets. They are simple to describe, and once you know what to look for, simple to fix. The reason they persist is that individually they are small – so nobody escalates them – while collectively they undermine reporting accuracy and slow every data project that follows.
Date & Time Inconsistency
Dates entered as strings in mixed formats – 01/02/2024 alongside 2024-02-01 alongside Feb 1 2024 – are the single most common cause of silent data corruption. The system will accept all three, but downstream queries will silently exclude or misinterpret the minority pattern. Time zones cause similar damage: a UTC timestamp processed as local time will be hours out without ever throwing an error.
Numbers Stored as Text
Sales prices arriving as “£1,234.56” or “(450.00)” for negatives, weights with units appended, postcodes with and without gaps – all of these break aggregation and silently exclude rows from numeric comparisons. The column looks fine in a spreadsheet but produces nonsense when summed.
Encoding Artefacts & Mojibake
If you have ever seen £ instead of £, or é instead of é, you have seen mojibake – the visible signature of data read with the wrong character encoding. It is harmless to look at but breaks every text comparison: “Café” and “Café” are not equal, so data joins on company names quietly drop rows.
Whitespace, Casing & Invisible Characters
“ACME LTD ” with a trailing space is not the same value as “ACME LTD” without one. Data joins fail, deduplication misses obvious duplicates, and aggregations split across what should be a single client. Worse, control characters and zero-width spaces from copy-pasted spreadsheet data are invisible even when you look directly at them.
Inconsistent Null Markers
Within a single column, you may find genuine nulls alongside empty strings, the literal text “NULL”, “N/A”, “-“, “unknown”, and the number 0 used as a placeholder. Each of these means roughly the same thing to a human and something completely different to a query.
Duplicate & Near-Duplicate Rows
Exact duplicates are easy to spot. Near-duplicates are not – the same booking imported twice with one column trimmed differently, the same client recorded as both “Acme Ltd” and “Acme Limited”, the same invoice loaded once from a CSV and once from a system export. Each pair inflates revenue, distorts forecasts, and corrupts customer counts.
Referential Integrity Gaps
Every foreign key value in your transactions table should point at a real record in the master table. In practice, you will find orphan client codes, cost centre references that don’t exist, and service codes that were retired six months ago and are still being used. Each orphan is a row that cannot be cleanly attributed in any report.
Outliers That Aren’t Actually Impossible
A negative consignment weight, a delivery date before the booking date, a sale price five orders of magnitude larger than the typical row – these are not errors that crash the pipeline. They flow through, distort the averages, and quietly bias every report that touches them. Often they originate from a unit-of-measure switch (grams entered as kilograms) or a decimal-place typo.
A Checklist to Run Before Data Hits Production
Use this as a starting point. You will not need every check on every dataset, but if you cannot say with confidence that each box has been considered, you have an unknown risk in your pipeline.
“If you cannot say with confidence that each box has been considered, you have an unknown risk in your pipeline.”
- File arrived in the expected format and encoding (UTF-8 unless agreed otherwise)
- Row count is within the expected range for this load – no empty load, no runaway load
- All required columns are present and named as expected
- Schema matches the previous successful load – new or removed columns are flagged for review
- Dates and times parse cleanly to a single canonical format and timezone
- Numeric fields are numeric – no embedded currency symbols, commas, or stray text
- Text fields are trimmed and casing is normalised on identifiers
- Null markers are consistent – only one representation of “missing” exists in any column
- Encoding artefacts have been detected and either fixed or rejected
- Foreign keys resolve cleanly to a master record (clients, services, cost centres)
- Values fall within plausible ranges – no negative weights, no impossible dates
- Mandatory business fields are populated for the row’s status (e.g. completed jobs have a delivery date)
- Currency, units of measure, and time zones are explicit and consistent
- Row count matches the source system’s count for the same period
- Key totals (revenue, volume, weight) reconcile to the source within an agreed tolerance
- Distinct key values match between source and target
- Every modification a cleansing rule made is logged against the row it changed
- Raw source data has been preserved untouched in the staging area
- Failed rows are quarantined with a reason for human review – not silently dropped
- The pipeline has alerted a named owner if any check failed
Want to Talk This Through?
Most teams know these principles are right and find that day-to-day delivery pressure makes systematic implementation difficult. If you’d like to discuss how they apply to your pipeline – or where to start – get in touch or download your own checklist.

