📋 Briefing Note – DigiRelevance

Data Hygiene

What to clean before it hits production, and the pitfalls that quietly cost you money

Most operational data quality problems are not caused by exotic edge cases. They are caused by a small set of recurring issues that creep in at the point data enters a system, accumulate quietly for months, and only surface when someone tries to bill, forecast, or report on the back of them.

Data Quality
Practical Briefing
Operations & Data Teams
DigiRelevance
Part One

Six Principles Worth Living By

1

Cleanse on the Way In, Not on the Way Out

It is tempting to load raw data and clean it later inside reports or dashboards. This always costs more in the long run. The same cleansing logic gets re-implemented in every report, slightly differently each time, until you can no longer tell which version of “total revenue” is the right one. Run cleansing as a clearly defined step before data lands in the production tables, document what was changed and why, and treat the production layer as a clean source of truth from which all reporting flows.

⚠️ Pitfall: If you find yourself writing IF/AND/OR statements in Power BI or Excel to fix data values, that logic belongs upstream in your ingestion pipeline. Every report that re-implements it is a bug waiting to happen.

2

Preserve the Raw, Never Overwrite It

Always keep an untouched copy of the source data in a staging area. Cleansing should produce a transformed copy, not a destructive update of the original. When something goes wrong six months later – and it will – you need to be able to reproduce exactly what came in, what was changed, and what landed in production.

3

Validate Before You Transform

Before applying business rules, run cheap structural checks: is the row count plausible, are required columns present, are date ranges within reasonable bounds, do the totals roughly match the source system. A failing structural check should stop the pipeline and alert someone – not silently produce a half-loaded dataset that takes a week to spot.

4

Make Every Change Attributable

Every cleansed row should carry, at minimum, a flag that says it was modified, a reference to the rule that modified it, and a timestamp. This is what lets you answer the auditor’s question – “why does this row in production look different from the one in the source extract?” – without trawling through code.

5

Quantify the Cost of Being Wrong

When you find a data quality issue, attach a number to it: how many rows, how much spend, what proportion of the dataset. “We have some duplicates” never gets prioritised. “We have 2,847 duplicate records representing £143,000 of double-counted revenue” gets fixed on Monday. The numbers turn data quality from a technical problem into a commercial one – which is the only framing that ever attracts budget.

6

Treat System-Generated Data with Healthy Scepticism

There is a persistent assumption that data produced by a computer system must be correct, simply because a computer produced it. Systems generate data that is internally consistent – meaning it follows their rules and survives their validation – but that is a much weaker guarantee than being right.

Bad data finds its way into clean-looking systems through surprising routes: free-text fields repurposed for structured values, workaround entries to pass validation, overnight jobs that run twice and double a day’s data, device feeds that silently report zeros instead of nothing. None of these are detectable from inside the system that produced them.

⚠️ Pitfall: If a number from a system feels surprising – too high, too low, too round, too consistent – investigate before you report it. The instinct that something looks wrong is often the only signal you’ll get. Dismissing it because “the system generated it” is how bad data ends up in board packs.


Part Two

The Pitfalls That Quietly Cost You Money

Each of the issues below is encountered in roughly nine out of ten datasets. They are simple to describe, and once you know what to look for, simple to fix. The reason they persist is that individually they are small – so nobody escalates them – while collectively they undermine reporting accuracy and slow every data project that follows.

📅

Date & Time Inconsistency

Dates entered as strings in mixed formats – 01/02/2024 alongside 2024-02-01 alongside Feb 1 2024 – are the single most common cause of silent data corruption. The system will accept all three, but downstream queries will silently exclude or misinterpret the minority pattern. Time zones cause similar damage: a UTC timestamp processed as local time will be hours out without ever throwing an error.

Fix: Standardise on ISO 8601 (YYYY-MM-DD HH:MM:SS) at the ingestion boundary. Store all timestamps in UTC and convert to local only at the point of display. Reject rows where the date cannot be parsed cleanly, rather than guessing.
🔢

Numbers Stored as Text

Sales prices arriving as “£1,234.56” or “(450.00)” for negatives, weights with units appended, postcodes with and without gaps – all of these break aggregation and silently exclude rows from numeric comparisons. The column looks fine in a spreadsheet but produces nonsense when summed.

Fix: Strip currency symbols, commas, and parentheses (converting to negatives) at ingestion. Validate that what should be numeric actually is, and quarantine anything that can’t be parsed instead of silently zeroing it.
🔤

Encoding Artefacts & Mojibake

If you have ever seen £ instead of £, or é instead of é, you have seen mojibake – the visible signature of data read with the wrong character encoding. It is harmless to look at but breaks every text comparison: “Café” and “Café” are not equal, so data joins on company names quietly drop rows.

Fix: Specify UTF-8 explicitly at every read and write boundary. Treat the absence of an encoding declaration as a bug, not a default.

Whitespace, Casing & Invisible Characters

“ACME LTD ” with a trailing space is not the same value as “ACME LTD” without one. Data joins fail, deduplication misses obvious duplicates, and aggregations split across what should be a single client. Worse, control characters and zero-width spaces from copy-pasted spreadsheet data are invisible even when you look directly at them.

Fix: Trim whitespace and normalise casing on identifiers at ingestion. Strip non-printable characters from text fields. If two values look identical to the human eye, they should be identical to the database.

Inconsistent Null Markers

Within a single column, you may find genuine nulls alongside empty strings, the literal text “NULL”, “N/A”, “-“, “unknown”, and the number 0 used as a placeholder. Each of these means roughly the same thing to a human and something completely different to a query.

Fix: Pick one representation of “missing” – true SQL NULL is usually correct – and convert everything else to it at ingestion. Document what the legacy variants meant before you collapse them, in case the distinction was actually meaningful.
👥

Duplicate & Near-Duplicate Rows

Exact duplicates are easy to spot. Near-duplicates are not – the same booking imported twice with one column trimmed differently, the same client recorded as both “Acme Ltd” and “Acme Limited”, the same invoice loaded once from a CSV and once from a system export. Each pair inflates revenue, distorts forecasts, and corrupts customer counts.

Fix: Define what makes a row unique (the natural key) before loading, not after. Run a duplicate-detection step against both exact and normalised values, and flag candidates for review rather than blindly merging them.
🔗

Referential Integrity Gaps

Every foreign key value in your transactions table should point at a real record in the master table. In practice, you will find orphan client codes, cost centre references that don’t exist, and service codes that were retired six months ago and are still being used. Each orphan is a row that cannot be cleanly attributed in any report.

Fix: Run referential checks on every load. Quantify orphans by row count and by financial impact. Decide consciously whether to reject, quarantine, or auto-create the missing master record – but do not let orphans land silently.
📊

Outliers That Aren’t Actually Impossible

A negative consignment weight, a delivery date before the booking date, a sale price five orders of magnitude larger than the typical row – these are not errors that crash the pipeline. They flow through, distort the averages, and quietly bias every report that touches them. Often they originate from a unit-of-measure switch (grams entered as kilograms) or a decimal-place typo.

Fix: Define plausible ranges per field as part of your data contract. Rows that fall outside should be flagged, reviewed, and corrected at source – not absorbed into the dataset.

Part Three

A Checklist to Run Before Data Hits Production

Use this as a starting point. You will not need every check on every dataset, but if you cannot say with confidence that each box has been considered, you have an unknown risk in your pipeline.

“If you cannot say with confidence that each box has been considered, you have an unknown risk in your pipeline.”

Structural Checks
  • File arrived in the expected format and encoding (UTF-8 unless agreed otherwise)
  • Row count is within the expected range for this load – no empty load, no runaway load
  • All required columns are present and named as expected
  • Schema matches the previous successful load – new or removed columns are flagged for review
Field-Level Checks
  • Dates and times parse cleanly to a single canonical format and timezone
  • Numeric fields are numeric – no embedded currency symbols, commas, or stray text
  • Text fields are trimmed and casing is normalised on identifiers
  • Null markers are consistent – only one representation of “missing” exists in any column
  • Encoding artefacts have been detected and either fixed or rejected
Business-Rule Checks
  • Foreign keys resolve cleanly to a master record (clients, services, cost centres)
  • Values fall within plausible ranges – no negative weights, no impossible dates
  • Mandatory business fields are populated for the row’s status (e.g. completed jobs have a delivery date)
  • Currency, units of measure, and time zones are explicit and consistent
Reconciliation Checks
  • Row count matches the source system’s count for the same period
  • Key totals (revenue, volume, weight) reconcile to the source within an agreed tolerance
  • Distinct key values match between source and target
Operational Checks
  • Every modification a cleansing rule made is logged against the row it changed
  • Raw source data has been preserved untouched in the staging area
  • Failed rows are quarantined with a reason for human review – not silently dropped
  • The pipeline has alerted a named owner if any check failed

Want to Talk This Through?

Most teams know these principles are right and find that day-to-day delivery pressure makes systematic implementation difficult. If you’d like to discuss how they apply to your pipeline – or where to start – get in touch or download your own checklist.

Data Quality ETL / Pipelines SQL Data Engineering Operational Reporting Azure