This is where QA meets SQL. These are not textbook problems. These are real bugs that slip through every sprint. Orphaned records. Duplicates. Mismatched totals. Incomplete data. Your job is to find them before the customer does.
Our practice database has intentional data quality issues planted in it. Let us hunt them down.
Orphan records are rows that reference a parent that does not exist. In our schema, orders without customers, or order_items without orders. These break joins and cause blank rows in reports.
-- Check 1: Orders without a valid customer
SELECT o.order_id, o.customer_id, o.order_date
FROM orders o
LEFT JOIN customers c ON o.customer_id = c.customer_id
WHERE c.customer_id IS NULL;
-- Check 2: Order items without a valid order
SELECT oi.item_id, oi.order_id, oi.product_id
FROM order_items oi
LEFT JOIN orders o ON oi.order_id = o.order_id
WHERE o.order_id IS NULL;
-- Check 3: Order items referencing non-existent products
SELECT oi.item_id, oi.order_id, oi.product_id
FROM order_items oi
LEFT JOIN products p ON oi.product_id = p.product_id
WHERE p.product_id IS NULL;
-- Check 4: Reviews by customers who do not exist
SELECT r.review_id, r.customer_id, r.product_id
FROM reviews r
LEFT JOIN customers c ON r.customer_id = c.customer_id
WHERE c.customer_id IS NULL;
-- Check 5: Payments for orders that do not exist
SELECT pay.payment_id, pay.order_id
FROM payments pay
LEFT JOIN orders o ON pay.order_id = o.order_id
WHERE o.order_id IS NULL;Duplicates are silent killers. A customer registered twice? They get two welcome emails. An order item inserted twice? Revenue doubles. Find them all.
-- Check 1: Duplicate customer emails
SELECT email, COUNT(*) AS occurrences
FROM customers
GROUP BY email
HAVING COUNT(*) > 1;
-- Check 2: Duplicate order items (same order, same product)
SELECT order_id, product_id, COUNT(*) AS occurrences
FROM order_items
GROUP BY order_id, product_id
HAVING COUNT(*) > 1;
-- Check 3: Duplicate reviews (same customer, same product)
SELECT customer_id, product_id, COUNT(*) AS occurrences
FROM reviews
GROUP BY customer_id, product_id
HAVING COUNT(*) > 1;
-- Check 4: Multiple payments for the same order
SELECT order_id, COUNT(*) AS payment_count
FROM payments
GROUP BY order_id
HAVING COUNT(*) > 1;The order total should equal the sum of line items. The payment amount should equal the order total. When these do not match, money is being lost or created out of thin air.
-- Check 1: Order total vs sum of line items
SELECT o.order_id, o.total_amount AS order_total,
SUM(oi.quantity * oi.unit_price) AS calculated_total,
o.total_amount - SUM(oi.quantity * oi.unit_price) AS difference
FROM orders o
JOIN order_items oi ON o.order_id = oi.order_id
GROUP BY o.order_id, o.total_amount
HAVING o.total_amount != SUM(oi.quantity * oi.unit_price);
-- Check 2: Payment amount vs order total
SELECT o.order_id, o.total_amount AS order_total,
p.amount AS payment_amount,
o.total_amount - p.amount AS difference
FROM orders o
JOIN payments p ON o.order_id = p.order_id
WHERE o.total_amount != p.amount;
-- Check 3: Unit price in order_items vs product price
-- (Catches silent price changes not reflected in orders)
SELECT oi.item_id, oi.order_id, p.product_name,
oi.unit_price AS charged_price,
p.price AS current_price,
oi.unit_price - p.price AS price_difference
FROM order_items oi
JOIN products p ON oi.product_id = p.product_id
WHERE oi.unit_price != p.price;Amount mismatch bugs are the scariest. They often pass testing because the UI shows the correct total — but the database stores a different number. Always validate at the database level.
Every order should have items. Every order should have a payment. Every completed order should have a successful payment. Check what is missing.
-- Check 1: Orders with no order items
SELECT o.order_id, o.order_date, o.total_amount, o.status
FROM orders o
LEFT JOIN order_items oi ON o.order_id = oi.order_id
WHERE oi.item_id IS NULL;
-- Check 2: Orders with no payment record
SELECT o.order_id, o.order_date, o.total_amount, o.status
FROM orders o
LEFT JOIN payments p ON o.order_id = p.order_id
WHERE p.payment_id IS NULL;
-- Check 3: Completed orders without successful payment
SELECT o.order_id, o.order_date, o.total_amount,
p.payment_status
FROM orders o
LEFT JOIN payments p ON o.order_id = p.order_id
WHERE o.status = 'completed'
AND (p.payment_status != 'success' OR p.payment_id IS NULL);
-- Check 4: Cancelled orders with successful payment (refund needed?)
SELECT o.order_id, o.order_date, o.total_amount,
p.payment_status, p.payment_method
FROM orders o
JOIN payments p ON o.order_id = p.order_id
WHERE o.status = 'cancelled'
AND p.payment_status = 'success';These checks verify that the data makes logical sense. Dates should be in order. Ratings should be within range. Customers should exist before they can order.
-- Check 1: Orders placed before customer signup
SELECT c.customer_id, c.first_name, c.signup_date,
o.order_id, o.order_date
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
WHERE o.order_date < c.signup_date;
-- Check 2: Payment date before order date
SELECT o.order_id, o.order_date,
p.payment_date, p.payment_id
FROM orders o
JOIN payments p ON o.order_id = p.order_id
WHERE p.payment_date < o.order_date;
-- Check 3: Review date before order date for the same product
SELECT r.review_id, c.first_name,
p.product_name, r.review_date,
MIN(o.order_date) AS earliest_order
FROM reviews r
JOIN customers c ON r.customer_id = c.customer_id
JOIN products p ON r.product_id = p.product_id
JOIN order_items oi ON oi.product_id = r.product_id
JOIN orders o ON oi.order_id = o.order_id
AND o.customer_id = r.customer_id
GROUP BY r.review_id, c.first_name, p.product_name, r.review_date
HAVING r.review_date < MIN(o.order_date);
-- Check 4: Negative or zero stock for products that are being sold
SELECT p.product_id, p.product_name, p.stock_quantity,
COUNT(oi.item_id) AS times_ordered
FROM products p
JOIN order_items oi ON p.product_id = oi.product_id
WHERE p.stock_quantity <= 0
GROUP BY p.product_id, p.product_name, p.stock_quantity;Save these validation queries as a test suite. Run them after every deployment, every migration, every data import. Automate them with a cron job or CI pipeline. Prevention is better than a production incident.
Q: How do you validate data quality in a database?
A: I use a systematic approach with 5 categories of SQL checks. (1) Orphan checks — LEFT JOIN + IS NULL to find records referencing non-existent parents. (2) Duplicate checks — GROUP BY + HAVING COUNT > 1 to find duplicate rows. (3) Mismatch checks — comparing calculated values vs stored values (e.g., order total vs sum of line items). (4) Completeness checks — finding records that are missing required related data (orders without payments). (5) Business logic checks — verifying date sequences, value ranges, and logical constraints. I save these queries as reusable test suites and run them after every deployment.
Key Point: Five categories of data validation: orphan checks, duplicate detection, amount mismatches, completeness verification, and business logic validation. This is what QA engineers actually do with SQL every day.