Why Software Is So Hard to Test (and Why AI Makes It Worse)

Target Audience

Software engineering leaders, platform leaders, and AI/ML leaders responsible for release velocity, quality, and risk management.

Executive Summary

Modern software teams know the uncomfortable truth: production data is the best test data. Unfortunately, production data is also full of PII, secrets, and sensitive context that cannot legally or ethically be exposed to developers or test environments.

This tension creates a persistent quality and deployment speed gap. Teams test with incomplete, synthetic, or outdated data and are then surprised when production behaves differently. AI agents amplify this problem by depending on realistic data distributions, sequences, and edge cases that are nearly impossible to recreate safely.

This post explains why PII is one of the core blockers to good testing, why it’s more hidden than most teams realize and why traditional approaches fall short.

Why realistic test data matters
The hidden nature of PII in modern systems
Every technology leaks PII differently
Why developers can’t observe real production behavior
Traditional test data workarounds (and why they fail)
What AI coding agents expose: The real problem we need to solve
Looking ahead to Part 2

1. Why realistic test data matters

High-quality testing depends on production-grade behavior:

Real request and response shapes
Real payload sizes and value distributions
Real timing, ordering, and error conditions
Real edge cases that no one thought to simulate

Synthetic data and hand-crafted fixtures tend to validate schemas, not systems. They confirm that software works in theory, not that it survives contact with reality.

For AI coding agents, this problem is magnified because AI systems are inherently stochastic which means they produce non-deterministic outputs even with identical inputs. When you introduce fake or synthetic test data into an already non-deterministic system, you’re compounding the uncertainty. The AI agent must navigate two layers of unpredictability: its own stochastic behavior and the artificial patterns in synthetic data that don’t match real-world distributions. This double layer of non-determinism makes the code even more unreliable, as you’ve introduced more variability into an already fuzzy system. Real production data provides the grounding that stochastic AI systems need to produce consistent and reliable results.

Relying on incomplete or artificial data sets can result in missing critical test cases, which reduces overall test coverage and may leave sensitive data exposed. High-quality, relevant data ensures that applications are thoroughly tested against realistic scenarios, reducing bugs and defects.

2. The hidden nature of PII in modern systems

PII is not confined to obvious database columns like email or phone_number.

In modern distributed systems, PII is often hidden, including:

Base64-encoded fields inside API payloads that look opaque until decoded
JWTs that contain confidential claims, identifiers, roles, or business context
Nested JSON objects, headers, and metadata fields
Binary formats like gRPC and Protobuf that require decoding just to inspect

This means teams often don’t know where PII exists until it’s already leaked. Each technology layer requires technology-aware inspection and transformation. Treating all systems the same leads to blind spots or incomplete testing. Check your team’s staging environment for an example.

“Just mask the data” assumes you can see the data first, which is increasingly untrue.

4. Why developers can’t observe real production behavior

Because PII is everywhere — and often invisible — organizations restrict access to production data entirely.

The result:

Logs are redacted or truncated
Payloads are dropped
Observability tools show metrics and traces without context

Developers can see that something failed, but not why. This lack of observability doesn’t just slow debugging — it permanently lowers software quality by preventing teams from learning from real production behavior. You can learn more about safe and deep visibility in our observability video series.

5. Traditional test data workarounds (and why they fail)

Common approaches include:

Test Data Management (TDM) — Enterprise platforms like Delphix, Broadcom Test Data Manager, Informatica TDM, and IBM InfoSphere Optim that provide data subsetting, masking, and provisioning workflows
Static data masking — Tools like IRI FieldShield, K2view Data Masking, and DataSunrise that permanently transform sensitive data at rest
Synthetic data generation — Platforms such as Tonic.ai, GenRocket, Gretel.ai, and YData that create artificial datasets mimicking production patterns
Periodic database snapshots — Database cloning and snapshot technologies that capture point-in-time database states for testing

These approaches assume:

PII locations are known and static
Data is batch-oriented
Systems change slowly and data has predictable locations

Modern systems violate all three assumptions. As architectures become more distributed and event-driven, these methods struggle to keep up — especially when traffic shape and sequence matter more than individual records. Test data management tools can help minimize storage costs by reducing the number of redundant data copies.

6. What AI coding agents expose: The real problem we need to solve

AI coding agents are exposing a fundamental problem that has been hiding in plain sight. These agents fall prey to the same traps as human engineers:

Rare edge cases
Real user behavior patterns
Long-tail distributions
Sequential decision-making context

Sanitized or synthetic data often removes the very signals AI coding agents rely on, leading to:

Overconfidence in test results
Surprising failures in production
Slower iteration due to fear-driven release processes

But here’s what makes AI coding agents different: their stochastic nature amplifies the consequences. When you combine non-deterministic AI behavior with synthetic or sanitized test data, you’re compounding uncertainty in ways that make failures more frequent and harder to predict. The AI’s inability to produce reliable code when working with unrealistic data surfaces a truth that human engineers have learned to work around: we’ve been testing with inadequate data all along.

This is not just a compliance problem, and it’s not just an AI problem. The real challenge that AI coding agents are forcing us to confront is:

How do we safely observe and reuse real production behavior to improve software quality?

Until teams can answer that question, testing will remain slower, riskier, and less representative than production demands—whether the code is written by humans or AI.

7. Coming Next: Why Traditional Test Data Management Falls Apart

In Part 2, we’ll examine why classic Test Data Management was built for a different era. We no longer live in a world dominated by batch processing and monolithic databases. Modern systems require streaming, real-time approaches instead.

The PII Testing Dilemma