Data Accuracy — Source vs Target Validation

Accuracy means the data is not just present — it is correct. The name in the database matches what the user typed. The balance matches the calculated value. The migrated data matches the source. This is where you put on your auditor hat and compare numbers.

Source vs Target Comparison

After every data migration, ETL job, or system integration, you must compare source data with target data. Same records, same values, zero differences. Any mismatch is a bug.

source-target-comparison.sqlsql

-- Step 1: Count comparison
SELECT 'source' AS system, COUNT(*) FROM old_users
UNION ALL
SELECT 'target', COUNT(*) FROM users;

-- Step 2: Find records in source but NOT in target (lost records)
SELECT s.user_id, s.email, s.full_name
FROM old_users s
LEFT JOIN users t ON s.email = t.email
WHERE t.email IS NULL;

-- Step 3: Find records in target but NOT in source (extra records)
SELECT t.user_id, t.email, t.full_name
FROM users t
LEFT JOIN old_users s ON t.email = s.email
WHERE s.email IS NULL;

-- Step 4: Find records where values differ
SELECT s.email,
       s.full_name AS source_name, t.full_name AS target_name,
       s.phone AS source_phone, t.phone AS target_phone
FROM old_users s
JOIN users t ON s.email = t.email
WHERE s.full_name != t.full_name
   OR s.phone != t.phone;

Expected vs Actual Values

expected-vs-actual.sqlsql

-- Balance accuracy: calculated balance vs stored balance
SELECT
  a.account_id,
  a.balance AS stored_balance,
  COALESCE(SUM(CASE
    WHEN t.to_account = a.account_id THEN t.amount
    WHEN t.from_account = a.account_id THEN -t.amount
  END), 0) AS calculated_balance,
  a.balance - COALESCE(SUM(CASE
    WHEN t.to_account = a.account_id THEN t.amount
    WHEN t.from_account = a.account_id THEN -t.amount
  END), 0) AS difference
FROM accounts a
LEFT JOIN transactions t
  ON a.account_id IN (t.from_account, t.to_account)
GROUP BY a.account_id, a.balance
HAVING a.balance != COALESCE(SUM(CASE
    WHEN t.to_account = a.account_id THEN t.amount
    WHEN t.from_account = a.account_id THEN -t.amount
  END), 0);

Handling Decimal Tolerance

When dealing with decimal numbers, floating-point math can introduce tiny rounding errors. 10.00 / 3 might produce 3.333333...34 instead of 3.33. You cannot use strict equality. Instead, compare within a tolerance.

decimal-tolerance.sqlsql

-- WRONG: Strict equality fails with floating-point
SELECT * FROM invoices
WHERE total_amount != calculated_total;  -- Catches rounding noise

-- RIGHT: Compare within tolerance (0.01 = 1 paisa)
SELECT invoice_id, total_amount, calculated_total,
       ABS(total_amount - calculated_total) AS difference
FROM invoices
WHERE ABS(total_amount - calculated_total) > 0.01;
-- Only flags differences greater than 1 paisa

-- For percentage-based tolerance (useful for large amounts)
SELECT invoice_id, total_amount, calculated_total
FROM invoices
WHERE ABS(total_amount - calculated_total) / total_amount > 0.001;
-- Flags differences greater than 0.1%

Never store monetary values as FLOAT or DOUBLE. Always use DECIMAL(precision, scale). FLOAT introduces rounding errors that accumulate over thousands of transactions. Rs 0.01 error per transaction times 100,000 transactions per day = Rs 1,000 per day. I have seen this in production.

Q: How do you validate data accuracy after a data migration?

A: I follow a four-step process. First, count comparison — source and target should have the same number of records. Second, I find records missing from the target using LEFT JOIN. Third, I find extra records in the target that do not exist in the source. Fourth, I compare field values for matching records to catch any data transformation errors — truncation, encoding changes, rounding issues. For decimal fields, I use a tolerance-based comparison (ABS difference < 0.01) to handle floating-point precision. I document each discrepancy with the exact SQL query and expected vs actual values.

Key Point: Data accuracy means every value matches its expected source. Compare source vs target after migrations, and use tolerance for decimal comparisons.

Previous Up NextBusiness Rule Validation

Chapter 5: Data Validation Techniques