Data Transformation Testing

Migration is not always a simple copy. Data gets transformed along the way. Columns split. Columns merge. Data types change. Default values fill in gaps. Encoding converts. Each transformation is a potential point of failure. Each one needs its own test.

Common Data Transformations

Transformation	Source	Target	Risk
Column split	name = "Rajesh Verma"	first_name = "Rajesh", last_name = "Verma"	Multi-word names, single names, titles (Dr., Mr.)
Column merge	street + city + state + pin	full_address = "MG Road, Pune, MH 411001"	Missing components, delimiter handling
Type conversion	VARCHAR "2024-01-15"	DATE 2024-01-15	Invalid dates, format ambiguity (01/02/2024 — Jan 2 or Feb 1?)
Default fill	NULL status	status = "ACTIVE"	NULL might mean something different than ACTIVE
Encoding change	Latin1 "caf\u00e9"	UTF-8 "caf\u00e9"	Hindi/Marathi text corruption, emoji loss
Precision change	DECIMAL(10,4) = 99.9999	DECIMAL(10,2) = 100.00	Rounding errors in financial data

Testing Column Splits

test-column-split.sqlsql

-- Source: customers.name (full name in one column)
-- Target: customers.first_name, customers.last_name

-- Test 1: Normal two-word names
SELECT s.customer_id, s.name,
       t.first_name, t.last_name,
       CONCAT(t.first_name, ' ', t.last_name) AS reconstructed
FROM source_db.customers s
JOIN target_db.customers t ON s.customer_id = t.customer_id
WHERE s.name != CONCAT(t.first_name, ' ', t.last_name)
LIMIT 20;

-- Test 2: Names with more than two words
-- "Aamir Khan Shah" — where does last name start?
SELECT s.customer_id, s.name,
       t.first_name, t.last_name
FROM source_db.customers s
JOIN target_db.customers t ON s.customer_id = t.customer_id
WHERE LENGTH(s.name) - LENGTH(REPLACE(s.name, ' ', '')) > 1;

-- Test 3: Single-word names (no space)
SELECT s.customer_id, s.name,
       t.first_name, t.last_name
FROM source_db.customers s
JOIN target_db.customers t ON s.customer_id = t.customer_id
WHERE s.name NOT LIKE '% %';

Testing Type Conversions

test-type-conversion.sqlsql

-- Source: date_of_birth stored as VARCHAR "15-01-1990"
-- Target: date_of_birth as DATE type

-- Test 1: Check for invalid dates that could not convert
SELECT customer_id, date_of_birth
FROM target_db.customers
WHERE date_of_birth IS NULL;

-- Cross-reference with source — were these NULL originally?
SELECT s.customer_id,
       s.date_of_birth AS source_dob_varchar,
       t.date_of_birth AS target_dob_date
FROM source_db.customers s
JOIN target_db.customers t ON s.customer_id = t.customer_id
WHERE s.date_of_birth IS NOT NULL
  AND t.date_of_birth IS NULL;
-- Any rows here = conversion failures (data loss)

-- Test 2: Verify dates match
SELECT s.customer_id,
       s.date_of_birth AS source_dob,
       DATE_FORMAT(t.date_of_birth, '%d-%m-%Y') AS target_dob_formatted
FROM source_db.customers s
JOIN target_db.customers t ON s.customer_id = t.customer_id
WHERE s.date_of_birth != DATE_FORMAT(t.date_of_birth, '%d-%m-%Y')
LIMIT 20;

Testing Default Value Fills

test-default-values.sqlsql

-- Migration rule: NULL status → "ACTIVE"
-- But wait — were all NULLs supposed to become ACTIVE?

-- Count NULLs in source
SELECT COUNT(*) AS null_status_in_source
FROM source_db.accounts
WHERE status IS NULL;

-- Count new ACTIVE in target that were NULL in source
SELECT COUNT(*) AS null_to_active
FROM source_db.accounts s
JOIN target_db.accounts t ON s.account_id = t.account_id
WHERE s.status IS NULL AND t.status = 'ACTIVE';

-- These two numbers should match
-- If null_to_active < null_status_in_source, some NULLs got a different value

Testing Precision Changes

test-precision.sqlsql

-- Source: balance DECIMAL(12,4) — four decimal places
-- Target: balance DECIMAL(12,2) — two decimal places

-- Find records where rounding changed the value
SELECT s.account_id,
       s.balance AS source_balance,
       t.balance AS target_balance,
       s.balance - t.balance AS rounding_diff
FROM source_db.accounts s
JOIN target_db.accounts t ON s.account_id = t.account_id
WHERE ROUND(s.balance, 2) != t.balance;

-- Total financial impact of rounding
SELECT
  SUM(s.balance) AS source_total,
  SUM(t.balance) AS target_total,
  SUM(s.balance) - SUM(t.balance) AS total_rounding_loss
FROM source_db.accounts s
JOIN target_db.accounts t ON s.account_id = t.account_id;

In financial systems, even a Rs 0.01 rounding difference per record can add up to lakhs across millions of records. Always calculate the total financial impact of precision changes. Auditors will ask for this number.

Testing Encoding Changes

test-encoding.sqlsql

-- Check for encoding corruption (garbled text)
SELECT customer_id, name
FROM target_db.customers
WHERE name LIKE '%?%'
   OR name LIKE '%\\ufffd%'
   OR LENGTH(name) != CHAR_LENGTH(name) * 1;  -- Multi-byte mismatch

-- Compare character lengths
SELECT s.customer_id,
       s.name AS source_name,
       t.first_name AS target_name,
       LENGTH(s.name) AS source_bytes,
       LENGTH(t.first_name) AS target_bytes
FROM source_db.customers s
JOIN target_db.customers t ON s.customer_id = t.customer_id
WHERE LENGTH(s.name) != LENGTH(t.first_name) + LENGTH(t.last_name) + 1;

Q: What data transformations do you test during migration and how?

A: I test six common transformations: (1) Column splits — verify multi-word names, single names, and names with titles are handled correctly by reconstructing the original and comparing. (2) Type conversions — check for conversion failures (source NOT NULL but target IS NULL), and verify converted values match the original format. (3) Default value fills — count source NULLs and verify they all received the expected default, not a different value. (4) Precision changes — find records where rounding altered values and calculate total financial impact. (5) Encoding changes — check for garbled text, question marks, and byte-length mismatches that indicate encoding corruption. (6) Column merges — verify all components are present and delimiters are correct. Each transformation gets its own SQL validation query.

Key Point: Test every transformation independently: column splits, type conversions, default fills, precision changes, and encoding. Each one can silently corrupt data in ways that record counts will never catch.

Previous Up NextReferential Integrity After Migration

Chapter 7: Data Migration Testing