1

Export Production Schema -- Get the exact table structure, indexes, constraints, and partitioning from production. Never create tables by hand -- you will miss indexes or constraints that affect performance.

2

Anonymize Production Data (if allowed) -- The gold standard. Take a production backup, replace PII (names, emails, phone numbers, addresses) with fake data, and load it into the test environment. Tools: PostgreSQL anonymizer, MySQL Data Masking, custom scripts.

3

Generate Synthetic Data (if production data is not available) -- Use tools like Faker, Mockaroo, or custom scripts. The key is matching production distributions -- if 80% of users have < 100 transactions and 1% have > 100,000, your generated data must follow the same pattern.

4

Prepare User Credentials -- Create a CSV file with test usernames and passwords. Each virtual user needs its own credentials to avoid session collisions. For 3,000 concurrent users, prepare at least 5,000 credentials (buffer for retries).

5

Seed Reference Data -- Populate lookup tables, configuration data, and other reference data that the app needs. Missing reference data causes 500 errors that look like performance failures but are really data issues.

6

Plan Data Cleanup -- After each test run, you may need to clean up created data (new transactions, orders, etc.) to reset the state for the next run. Automate this with SQL scripts or API calls.

Dimension	What It Means	Why It Matters	Example
Volume	Number of records matches production	Database query plans change with table size -- a full table scan on 100 rows is instant, on 10M rows it is catastrophic	10M user accounts, 50M transactions, 500K products
Variety	Data covers edge cases and distributions	If all test accounts have 5 transactions, you miss the user with 500,000 transactions who triggers a slow query	Mix of new accounts (10 transactions) and power users (500K transactions)
Velocity	Data changes during the test	INSERT-heavy tests reveal lock contention, auto-increment bottlenecks, and replication lag that read-only tests miss	Fund transfers create new rows in transactions table during test execution

Dimension	What It Means	Why It Matters	Example
Volume	Number of records matches production	Database query plans change with table size -- a full table scan on 100 rows is instant, on 10M rows it is catastrophic	10M user accounts, 50M transactions, 500K products
Variety	Data covers edge cases and distributions	If all test accounts have 5 transactions, you miss the user with 500,000 transactions who triggers a slow query	Mix of new accounts (10 transactions) and power users (500K transactions)
Velocity	Data changes during the test	INSERT-heavy tests reveal lock contention, auto-increment bottlenecks, and replication lag that read-only tests miss	Fund transfers create new rows in transactions table during test execution

Test Data Preparation -- Realistic Volume and Anonymization

The Three Vs of Test Data

Data Generation Strategies

Building Your Test Dataset

User Credential Files

Data Anonymization -- The Non-Negotiable