Chapter 9: Performance Test Planning
Test data is the silent killer of performance tests. I have seen teams spend weeks scripting beautiful JMeter tests, only to run them against a database with 50 rows and conclude "the app is blazing fast." Of course it is -- try that same query against 10 million rows with a missing index and watch it crawl. Your test data strategy must cover three dimensions: volume (enough rows), variety (different data patterns), and velocity (data that changes during the test).
| Dimension | What It Means | Why It Matters | Example |
|---|---|---|---|
| Volume | Number of records matches production | Database query plans change with table size -- a full table scan on 100 rows is instant, on 10M rows it is catastrophic | 10M user accounts, 50M transactions, 500K products |
| Variety | Data covers edge cases and distributions | If all test accounts have 5 transactions, you miss the user with 500,000 transactions who triggers a slow query | Mix of new accounts (10 transactions) and power users (500K transactions) |
| Velocity | Data changes during the test | INSERT-heavy tests reveal lock contention, auto-increment bottlenecks, and replication lag that read-only tests miss | Fund transfers create new rows in transactions table during test execution |
Export Production Schema -- Get the exact table structure, indexes, constraints, and partitioning from production. Never create tables by hand -- you will miss indexes or constraints that affect performance.
Anonymize Production Data (if allowed) -- The gold standard. Take a production backup, replace PII (names, emails, phone numbers, addresses) with fake data, and load it into the test environment. Tools: PostgreSQL anonymizer, MySQL Data Masking, custom scripts.
Generate Synthetic Data (if production data is not available) -- Use tools like Faker, Mockaroo, or custom scripts. The key is matching production distributions -- if 80% of users have < 100 transactions and 1% have > 100,000, your generated data must follow the same pattern.
Prepare User Credentials -- Create a CSV file with test usernames and passwords. Each virtual user needs its own credentials to avoid session collisions. For 3,000 concurrent users, prepare at least 5,000 credentials (buffer for retries).
Seed Reference Data -- Populate lookup tables, configuration data, and other reference data that the app needs. Missing reference data causes 500 errors that look like performance failures but are really data issues.
Plan Data Cleanup -- After each test run, you may need to clean up created data (new transactions, orders, etc.) to reset the state for the next run. Automate this with SQL scripts or API calls.
username,password,accountId,accountType,transactionCount
user_0001,Test@1234,ACC001,savings,150
user_0002,Test@1234,ACC002,current,45000
user_0003,Test@1234,ACC003,savings,12
user_0004,Test@1234,ACC004,current,890
user_0005,Test@1234,ACC005,savings,250000
...
user_5000,Test@1234,ACC5000,current,67
# Notes:
# - Each user has a unique accountId to avoid data collisions
# - transactionCount varies to simulate real distribution:
# 80% have < 1,000 transactions
# 15% have 1,000 - 50,000 transactions
# 5% have 50,000+ transactions (power users / business accounts)
# - accountType affects which queries run (savings vs current)
# - Password is the same for all test users (simplicity)If you are using a copy of production data, anonymization is not optional -- it is a legal requirement in most jurisdictions (GDPR, HIPAA, PCI DSS). Even in a "private" test environment, a data leak of real customer names, emails, or financial data is a compliance violation.
-- Anonymize production data for performance testing
-- Run AFTER restoring production backup to test environment
-- Step 1: Replace personal information
UPDATE users SET
first_name = 'TestUser' || id,
last_name = 'Account' || id,
email = 'testuser' || id || '@loadtest.local',
phone = '555' || LPAD(id::text, 7, '0'),
date_of_birth = '1990-01-01'::date + (id % 10000 || ' days')::interval,
address_line1 = id || ' Test Street',
city = 'Test City',
postal_code = LPAD((id % 99999)::text, 5, '0');
-- Step 2: Reset all passwords to a known test password
UPDATE users SET
password_hash = '$2b$10$abcdefghijklmnopqrstuuABCDEFGHIJKLMNOPQRSTUVWX';
-- bcrypt hash of "Test@1234"
-- Step 3: Anonymize financial data (preserve amounts and patterns)
UPDATE transactions SET
description = 'PERF_TEST_TXN_' || id,
merchant_name = 'Test Merchant ' || (id % 500);
-- Keep amounts, dates, and account relationships intact
-- These affect query performance and must be realistic
-- Step 4: Verify no PII remains
SELECT COUNT(*) FROM users
WHERE email NOT LIKE '%@loadtest.local';
-- Result must be 0Keep transaction amounts, dates, and relationships intact during anonymization. These affect query performance (JOINs, GROUP BYs, date range filters) and must reflect real patterns. Only replace identifying information -- names, emails, phone numbers, addresses, and national IDs.
Q: How do you handle test data for performance testing?
A: I follow a three-step approach. First, I ensure the data volume matches production -- if production has 10 million rows, my test database has 10 million rows. I have seen tests pass with 1,000 rows and fail spectacularly in production because query plans change completely with larger tables. Second, I ensure data variety -- not all test accounts should be identical. I create a mix of accounts: 80% with few transactions (under 1,000), 15% with moderate activity (1,000-50,000), and 5% power users with 100,000+ transactions. This catches slow queries that only trigger for heavy users. Third, I prepare dedicated user credentials in a CSV file -- each virtual user gets its own login to prevent session collisions. For anonymization, I replace all PII but keep transactional patterns (amounts, dates, relationships) intact because they affect query performance.
Key Point: Test data must match production in volume and distribution -- anonymize PII but preserve transactional patterns, and always prepare dedicated credentials for each virtual user.