Part 3: Building a Production-Grade Traffic Capture and Replay System

At a previous company, we had over 100 microservices. I’d make what seemed like a simple change to one service and deploy it, only to discover it broke something completely unrelated. A change to the user service would break checkout. An update to notifications would break reporting. We spent more time fixing unexpected bugs than shipping features.

The problem was our test scenarios were too simple. Production traffic had edge cases, timing dependencies, and integration patterns we never thought to test. That’s what drew me to Speedscale. A chance to build a traffic replay platform to catch these issues before release by using actual production patterns.

This is the third post in a 3-part series:

How to build a traffic capture system
How to build a traffic transform system
How to build a traffic replay system (you’re here)

You’ve captured production traffic (Part 1) and identified what needs to transform (Part 2). Now replay it against your new code to validate changes. This is fundamentally a big data problem: you have tons of traffic, storage is cheap, so use it.

The Core Problem

Replaying traffic means taking captured request/response pairs and sending them to your service under test. Sounds simple, but three problems emerge immediately:

State mismatch: Captured traffic references data that doesn’t exist in your test environment. A request updating order_123 fails because that order only exists in production. When simple test scenarios miss edge cases, users may encounter unexpected errors, usability issues, or technical problems that only appear in real-world situations.

Time staleness: Timestamps, JWT tokens, and date ranges become invalid the moment you capture them. A token captured today will no longer be valid tomorrow.

Non-determinism: Responses contain generated UUIDs, timestamps, and random values. Exact byte comparison fails 100% of the time.

The solution is selective mocking with transformed traffic. Replay inbound requests to your service with a live database, but mock downstream dependencies. By focusing testing effort on critical paths and high-impact areas, you can maximize the return on effort and reduce unnecessary work. Transform time-dependent values before replay.

Replay Strategies

Direct Replay Load Testing

Send captured requests directly to your service in a test environment. Typically, this is done by generating pcap files using tools like tcpdump or Wireshark, then replaying the traffic with a load test tool like k6 or jmeter to simulate real-world scenarios. Works for stateless, read-only endpoints with minimal dependencies.

Direct Replay

Limitations: Requires identical environment setup, breaks when dependencies are unavailable, no control over side effects. Unfortunatley the real world is typically more complex so this falls down after “hello world”.

Shadow Production Traffic

Duplicate live production traffic in real-time to a new version of your service. Original requests go to production; copies go to your new version with responses discarded.

Shadow Traffic

Pros: Immediate validation with real traffic patterns.

Cons: If your service writes to a database, you need to handle migrations and data cleanup. Bugs write bad data that must be backed out. It is very tricky to validate responses or test failure scenarios in isolation. You need an easy way to flip the switch between environments during shadow traffic testing to ensure a smooth transition.

Use for: Read-heavy services, performance validation, when you need immediate production-scale testing.

Replay Traffic with Service Mocks

Replay captured inbound traffic to your service while mocking downstream dependencies. Your service and database run live in a test environment; external APIs and other services are mocked. This works great to isolate your code from everything outside of your control for comprehensive testing and code analysis.

Traffic Replay with Mocks

This is the sweet spot. Test failures are good because they mean you found bugs before production. You can run tests repeatedly, validate responses in detail, no risk of corrupting production data. Mix production traffic with locally captured traffic from proxymock to cover both production patterns and new code paths.

Setup:

replay:
  target: http://service:8080
  traffic:
    - ./prod-traffic/         # Production patterns
    - ./local-traffic/        # New endpoints from proxymock
  mocks:
    - service: postgres-mock
      port: 5432
      traffic: ./prod-traffic/postgres/
    - service: payment-api
      port: 8081
      traffic: ./prod-traffic/payment-api/

Building the Traffic Replay Tool

Architecture

Four components work together:

Traffic Store: S3, filesystem, or database storing transformed request/response pairs. Transformation happened in Part 2—timestamps replaced with {{now}}, IDs mapped, tokens identified. Store organizes by timestamp, endpoint, and session.
Replay Orchestrator: Decides which traffic to replay, ordering, and speed. Handles timing between requests and parallelization. Injects runtime variables like {{now}} right before sending. The replay engine creates an authentic simulation of user interactions, resulting in more accurate and realistic testing environments.
Response Validator: Compares actual vs expected responses with configurable fuzzy matching.
Results Reporter: Aggregates pass/fail results, diffs, and timing data.

Key insight: Transformation is done once in advance (Part 2) for performance. At replay time, we only inject runtime variables:

function injectRuntimeVars(request) {
    // Simple string replacement for runtime values
    return request
        .replace(/{{now}}/g, new Date().toISOString())
        .replace(/{{uuid}}/g, crypto.randomUUID());
}

This is much faster than applying full transformation rules on every replay.

Timing and Ordering

Preserve temporal relationships between requests in a session. Maintaining high traffic quality during replay is essential for precise testing and accurate performance assessment, as it ensures the replayed scenarios closely reflect real-world network conditions. Here’s an example using Grafana k6:

import http from 'k6/http';
import { sleep, check } from 'k6';

const requests = JSON.parse(open('./captured-traffic.json'));

export default function() {
    let baseline = requests[0].timestamp;

    requests.forEach(req => {
        sleep(Math.min((req.timestamp - baseline) / 1000, 5.0));

        const body = injectRuntimeVars(req.body);
        const response = http.post(req.url, body);

        check(response, {
            'status matches': (r) => r.status === req.expectedStatus
        });

        baseline = req.timestamp;
    });
}

Key decisions: Replay sessions sequentially to preserve state dependencies. Run multiple sessions in parallel with k6 VUs for isolated state. Cap delays to speed up tests.

Mock Setup

Use captured outbound traffic to configure mocks. When your service calls a dependency, the mock returns the captured production response. Most teams use WireMock or MockServer for this:

// Configure WireMock stubs from captured traffic
const outboundTraffic = JSON.parse(open('./outbound-traffic.json'));

outboundTraffic.forEach(({ request, response }) => {
    http.post('http://wiremock:8080/__admin/mappings', JSON.stringify({
        request: {
            method: request.method,
            urlPath: request.path
        },
        response: {
            status: response.status,
            body: response.body,
            headers: response.headers
        }
    }), {
        headers: { 'Content-Type': 'application/json' }
    });
});

Response Validation

k6 provides built-in validation with check(). Focus on status codes and key business fields:

Tip: This example checks that response status codes are 200. Your script should extend this to validate response bodies and other business-critical fields.

Replays of real user sessions can also be used to identify usability issues and performance problems by reproducing actual traffic patterns during validation.

import { check } from 'k6';

// Validate response
check(response, {
    'status is 200': (r) => r.status === 200,
    'has orderId': (r) => JSON.parse(r.body).orderId !== undefined,
    'total is correct': (r) => JSON.parse(r.body).total === expectedTotal
});

For more complex validation, k6’s chai assertions provide expect-style matching.

Replay Orchestration (Phased Environment Bring-up)

Before you press “play” on traffic, bring the environment up in phases and promote only when stable. Plan bottom-up:

Phase 1: Backends (foundation)

Mock servers: Replace external HTTP dependencies with mocks (WireMock/MockServer). Use captured outbound traffic to build stubs. Gate: mock match rate ≥ 98%, unmatched calls = 0, p95 dependency latency within historical range.
Mock databases: Use protocol-aware fakes or record-and-reply stubs only when true DB state is hard to seed. Gate: query match coverage ≥ 95%, no unhandled SQL.
Real servers: For critical cross-service contracts, point to real services in an isolated namespace. Gate: no cross-namespace egress, auth works, rate limits disabled or raised.
Real databases: Prefer a real DB for the SUT with isolated data. Gate: schema at target migration, read/write paths validated, rollback/restore tested.

Example (conceptual) orchestration inputs:

orchestration:
  phase: backends
  mocks:
    - name: payment-api
      type: http
      image: wiremock/wiremock:3
      stubs: ./prod-traffic/payment-api/
      gates:
        match_rate: ">= 0.98"
        unmatched_requests: "== 0"
    - name: postgres-mock
      type: db
      traffic: ./prod-traffic/postgres/
      gates:
        query_match_rate: ">= 0.95"
  real:
    - name: catalog-svc
      namespace: replay
      gates:
        readiness: "true"
        latency_p95_delta: "<= +10%"

Promote to the next phase only when all backend gates are green.

Phase 2: System Under Test (applications)

Deploy the service(s) you are testing with production-like config but isolated data.

Readiness: health checks green for N minutes, migrations complete, caches warm.
Stability: logs free of new errors, CPU/mem steady, connection pools healthy.
Isolation: SUT only talks to mocks/approved real services.

orchestration:
  phase: sut
  apps:
    - name: orders-svc
      image: ghcr.io/acme/orders:pr-123
      env:
        DATABASE_URL: postgres://...
        PAYMENT_API_URL: http://payment-api.mock:8080
      gates:
        ready_seconds: ">= 120"
        error_rate: "== 0"
        cpu_delta: "<= +15%"
        mem_delta: "<= +15%"

Phase 3: Quality Gates

Define objective gates so promotion is automatic:

Status codes: distribution matches baseline; 5xx = 0; pass rate ≥ 99%.
Body comparisons: JSON-path scoped compare with tolerances and ignored fields (timestamps, UUIDs). Field-level match rate ≥ 95%.
Performance: p95 latency delta ≤ +10%; throughput within ±10% of baseline; error budget unchanged.
Resources: CPU and memory deltas ≤ +15%; GC pauses not worse than baseline; DB connection pool saturation unchanged.
Mocks: match rate ≥ 98%; unmatched = 0; new unmatched samples are captured as candidates for new stubs.
Side effects: write counts, emitted events, and external calls match baseline within agreed bounds.

Example gate configuration:

quality_gates:
  status_pass_rate: ">= 0.99"
  body_match_rate: ">= 0.95"
  mock_match_rate: ">= 0.98"
  latency_p95_delta: "<= +10%"
  cpu_delta: "<= +15%"
  mem_delta: "<= +15%"

These gates govern promotion to Phase 4.

Telemetry and Observability Mapping

Instrument every layer and tie signals back to the replay run ID:

Mocks: request count, match vs. miss rate, per-endpoint latency, fallback usage; persist unmatched examples for new stubs.
SUT: logs, traces, CPU, memory, GC, error rate, p50/p95/p99 latency, open connections, threadpool saturation; DB read/write counts.
Replay controller: sent/received, pass/fail counts, backpressure, queue depths.
Baseline: previous run or golden snapshot for deltas (latency, resource, correctness).

Tip: export all metrics with labels like run_id, service, endpoint, phase so reports can slice by phase and endpoint.

Mapping back to the design in this post:

Traffic Store supplies fixtures for mocks and expected responses.
Replay Orchestrator runs phases, enforces readiness/quality gates, and controls timing.
Response Validator computes correctness metrics (status/body diffs) and exposes pass rates.
Results Reporter joins correctness, performance, and mock metrics to baseline and emits the final verdict.

If any gate fails, keep artifacts, roll back to Phase 1 or 2 (depending on failure), and iterate.

Phase 4: Traffic Replay (inbound)

Only after backends and SUT are stable and Phase 3 gates are defined, start replay.

Workload: sessions ordered, concurrency sized, timing caps defined.
Cutoffs: abort on excessive 5xx, exploding latency, or mock miss spikes.
Artifacts: collect responses, diffs, metrics, and coverage.

orchestration:
  phase: replay
  traffic:
    sources: ["./prod-traffic/", "./local-traffic/"]
    concurrency: 10
    time_scale: 10x

State Management

The hardest problem in replay is ensuring the right data exists in your test database. By capturing and storing network traffic in a file, such as a PCAP file, you can accurately replay real scenarios for testing and troubleshooting, ensuring that the replayed data matches actual production conditions.

Transaction Isolation

Wrap each replay session in a transaction and rollback. Here’s a pattern using a setup/teardown script:

import { check } from 'k6';
import sql from 'k6/x/sql';

const db = sql.open('postgres', 'postgres://user:pass@localhost/testdb');

export function setup() {
    // Begin transaction
    db.exec('BEGIN');

    // Seed required state
    const sessionData = JSON.parse(open('./session-data.json'));
    seedDatabase(db, sessionData);

    return { sessionData };
}

export default function(data) {
    // Replay requests with seeded state
    data.sessionData.requests.forEach(req => {
        const response = http.post(req.url, req.body);
        check(response, { 'valid': (r) => r.status === req.expectedStatus });
    });
}

export function teardown(data) {
    // Rollback everything
    db.exec('ROLLBACK');
    db.close();
}

function seedDatabase(db, sessionData) {
    // Extract and create minimal entities
    sessionData.userIds.forEach(userId => {
        db.exec(`INSERT INTO users (id, email) VALUES ('${userId}', 'test-${userId}@example.com')`);
    });

    sessionData.orderIds.forEach(orderId => {
        db.exec(`INSERT INTO orders (id, user_id, status) VALUES ('${orderId}', '${sessionData.userIds[0]}', 'pending')`);
    });
}

ID Mapping

ID mapping happens during transformation (Part 2), so transformed traffic already has test IDs. By leveraging recorded traffic, accurate ID mapping ensures that replayed sessions closely mirror real-world scenarios. At replay time, you just use them as-is.

Mixing Production and Local Traffic

Combine production traffic with locally captured traffic to test both existing patterns and new code paths. The ability to capture traffic from both production and local environments enables comprehensive testing by replicating real-world user interactions for more accurate performance analysis and troubleshooting:

# Capture production traffic (Part 1)
./capture-prod-traffic --output ./prod-traffic/

# Run your service locally with proxymock
proxymock record --app-port 8080 --out ./local-traffic/

# Exercise new endpoints manually
curl http://localhost:4143/new-endpoint

# Replay combined traffic
./replay --traffic ./prod-traffic/,./local-traffic/ --target http://localhost:8080

This gives you:

Production coverage: Real user patterns, edge cases, load characteristics
New feature coverage: Locally captured traffic for endpoints that don’t exist in production yet

Scaling Replay

Parallel Execution

k6 handles parallelization automatically with Virtual Users (VUs). Each VU runs your script concurrently:

# Run 10 concurrent replay sessions
k6 run --vus 10 --duration 30s replay-script.js

Replay Speed Control

Replay faster than real-time to speed up tests:

// Replay at 10x speed
const delay = (req.timestamp - baseline) / 1000;
sleep(delay / 10.0);

Result Aggregation

k6 automatically aggregates results. Add custom metrics for tracking:

import { Counter } from 'k6/metrics';

const passed = new Counter('replay_passed');
const failed = new Counter('replay_failed');

export default function() {
    capturedRequests.forEach(req => {
        const response = http.post(req.url, req.body);
        const valid = check(response, {
            'status matches': (r) => r.status === req.expectedStatus
        });

        if (valid) {
            passed.add(1);
        } else {
            failed.add(1);
            console.error(`FAIL: ${req.url} - expected ${req.expectedStatus}, got ${response.status}`);
        }
    });
}

After the run, k6 outputs summary statistics:

replay_passed...........: 1247  41.5/s
replay_failed...........: 3     0.1/s

When to Use What

Approach	Use When	Avoid When
Direct Replay	Stateless services, read-only endpoints	Complex state, many dependencies
Shadow Traffic	Performance testing, production validation	Database writes, need response validation
Replay with Mocks	Integration testing, CI/CD, most scenarios	Early development (no traffic yet)

Recommended flow:

Capture production traffic continuously (Part 1)
Transform for reusability (Part 2)
Replay with mocking in CI/CD
Shadow traffic for final production validation

Tools

Tool	Approach	Best For
k6	Load testing with JavaScript	Performance testing, replay scripts
WireMock	HTTP mocking	Dependency mocking, API stubs
Speedscale	K8s-native replay + auto-mocking	Microservices, load testing
GoReplay	HTTP replay/shadowing	Simple HTTP replay
Diffy	Comparative replay	Regression detection
proxymock	Local traffic capture + replay	Development, CI/CD

Conclusion

Traffic replay is a big data problem: you have tons of captured traffic, storage is cheap, so use it. Store everything (Part 1), identify what changes (Part 2), then replay against new code with selective mocking.

The key is mixing production traffic with local traffic. Production gives you real patterns and edge cases. Local traffic from proxymock covers new endpoints and features. Together they validate both existing behavior and new code paths.

Start with replay + mocking—it’s the goldilocks zone. Shadow traffic is fast but risky. Direct replay is simple but limited. Build the six core components (store, orchestrator, transformer, mock manager, validator, reporter) and iterate.

Next steps:

Review Part 1: Traffic Capture
Review Part 2: Traffic Transform
Try proxymock for local capture and replay
Join our Community to discuss traffic replay

If you want a production-ready solution, check out Speedscale—it handles capture, transform, and replay with automatic mocking for Kubernetes.