Duplicate Detection

Duplicates are the cockroaches of database bugs. They sneak in silently and multiply. A user clicks Submit twice. A batch job runs twice. A migration inserts the same records again. Before you know it, the same customer has two accounts, the same order is billed twice, and the same employee gets two salary credits.

Exact Duplicates — GROUP BY HAVING

The bread and butter of duplicate detection. GROUP BY the column that should be unique, then HAVING COUNT > 1 to find violations.

find-exact-duplicates.sqlsql

-- Duplicate emails in users table
SELECT email, COUNT(*) AS occurrences
FROM users
GROUP BY email
HAVING COUNT(*) > 1;

-- Duplicate orders (same user, same amount, same date)
SELECT user_id, total_amount, DATE(created_at) AS order_date,
       COUNT(*) AS duplicate_count
FROM orders
GROUP BY user_id, total_amount, DATE(created_at)
HAVING COUNT(*) > 1;

-- See the actual duplicate rows (not just counts)
SELECT u.*
FROM users u
INNER JOIN (
  SELECT email
  FROM users
  GROUP BY email
  HAVING COUNT(*) > 1
) dup ON u.email = dup.email
ORDER BY u.email, u.created_at;

Fuzzy Duplicates — Near Matches

Not all duplicates are exact. "Ravi Sharma" and "ravi sharma" are the same person. "Ravi Sharma" and "Ravi Sharma" (double space) are the same person. "9876543210" and "+919876543210" are the same phone number. Fuzzy duplicate detection catches these.

find-fuzzy-duplicates.sqlsql

-- Case-insensitive duplicate emails
SELECT LOWER(email) AS normalized_email, COUNT(*) AS occurrences
FROM users
GROUP BY LOWER(email)
HAVING COUNT(*) > 1;

-- Duplicate names after trimming whitespace
SELECT TRIM(LOWER(full_name)) AS normalized_name, COUNT(*)
FROM users
GROUP BY TRIM(LOWER(full_name))
HAVING COUNT(*) > 1;

-- Duplicate phone numbers (strip country code and spaces)
SELECT
  REGEXP_REPLACE(phone, '[^0-9]', '', 'g') AS digits_only,
  COUNT(*)
FROM users
GROUP BY REGEXP_REPLACE(phone, '[^0-9]', '', 'g')
HAVING COUNT(*) > 1;

Exact Duplicates

●Same value in the column, byte-for-byte
●Easy to detect with GROUP BY HAVING
●Prevented by UNIQUE constraint
●Example: same email appears twice
●Fix: DELETE one, add UNIQUE constraint

Fuzzy Duplicates

●Same meaning, different representation
●Need LOWER(), TRIM(), REGEXP to normalize
●UNIQUE constraint alone cannot prevent
●Example: "Ravi" vs "ravi" vs " Ravi "
●Fix: Normalize data on INSERT (trigger or app code)

In financial systems, duplicate transactions are P1 bugs. If a salary credit runs twice, employees get paid double. If a debit runs twice, customers lose money. Always check for duplicates after any batch job or scheduled task.

Q: How do you detect and prevent duplicate records in a database?

A: To detect duplicates, I use GROUP BY with HAVING COUNT(*) > 1 on columns that should be unique. For fuzzy duplicates (case differences, extra spaces), I normalize the data using LOWER() and TRIM() before grouping. To prevent duplicates, I verify that UNIQUE constraints or unique indexes exist on the appropriate columns. For multi-column uniqueness (like one review per user per product), I check for composite unique constraints. I also test double-submit scenarios by clicking Submit rapidly to see if the application handles idempotency.

Key Point: Detect exact duplicates with GROUP BY HAVING. Detect fuzzy duplicates by normalizing with LOWER(), TRIM(), and regex before grouping.

Previous Up NextData Completeness — Finding Missing Data

Chapter 5: Data Validation Techniques