Why Software Is So Hard to Test (and Why AI Makes It Worse)
Target Audience
Software engineering leaders, platform leaders, and AI/ML leaders responsible for release velocity, quality, and risk management.
Executive Summary
Modern software teams know the uncomfortable truth: production data is the best test data. Unfortunately, production data is also full of PII, secrets, and sensitive context that cannot legally or ethically be exposed to developers or test environments.
This tension creates a persistent quality and deployment speed gap. Teams test with incomplete, synthetic, or outdated data and are then surprised when production behaves differently. AI agents amplify this problem by depending on realistic data distributions, sequences, and edge cases that are nearly impossible to recreate safely.
This post explains why PII is one of the core blockers to good testing, why it’s more hidden than most teams realize and why traditional approaches fall short.
Table of Contents
- Why realistic test data matters
- The hidden nature of PII in modern systems
- Every technology leaks PII differently
- Why developers can’t observe real production behavior
- Traditional test data workarounds (and why they fail)
- What AI coding agents expose: The real problem we need to solve
- Looking ahead to Part 2
1. Why realistic test data matters
High-quality testing depends on production-grade behavior:
- Real request and response shapes
- Real payload sizes and value distributions
- Real timing, ordering, and error conditions
- Real edge cases that no one thought to simulate
Synthetic data and hand-crafted fixtures tend to validate schemas, not systems. They confirm that software works in theory, not that it survives contact with reality.
For AI coding agents, this problem is magnified because AI systems are inherently stochastic which means they produce non-deterministic outputs even with identical inputs. When you introduce fake or synthetic test data into an already non-deterministic system, you’re compounding the uncertainty. The AI agent must navigate two layers of unpredictability: its own stochastic behavior and the artificial patterns in synthetic data that don’t match real-world distributions. This double layer of non-determinism makes the code even more unreliable, as you’ve introduced more variability into an already fuzzy system. Real production data provides the grounding that stochastic AI systems need to produce consistent and reliable results.
Relying on incomplete or artificial data sets can result in missing critical test cases, which reduces overall test coverage and may leave sensitive data exposed. High-quality, relevant data ensures that applications are thoroughly tested against realistic scenarios, reducing bugs and defects.
2. The hidden nature of PII in modern systems
PII is not confined to obvious database columns like email or phone_number.
In modern distributed systems, PII is often hidden, including:
- Base64-encoded fields inside API payloads that look opaque until decoded
- JWTs that contain confidential claims, identifiers, roles, or business context
- Nested JSON objects, headers, and metadata fields
- Binary formats like gRPC and Protobuf that require decoding just to inspect
This means teams often don’t know where PII exists until it’s already leaked. Each technology layer requires technology-aware inspection and transformation. Treating all systems the same leads to blind spots or incomplete testing. Check your team’s staging environment for an example.
“Just mask the data” assumes you can see the data first, which is increasingly untrue.
4. Why developers can’t observe real production behavior
Because PII is everywhere — and often invisible — organizations restrict access to production data entirely.
The result:
- Logs are redacted or truncated
- Payloads are dropped
- Observability tools show metrics and traces without context
Developers can see that something failed, but not why. This lack of observability doesn’t just slow debugging — it permanently lowers software quality by preventing teams from learning from real production behavior. You can learn more about safe and deep visibility in our observability video series.
5. Traditional test data workarounds (and why they fail)
Common approaches include:
- Test Data Management (TDM) — Enterprise platforms like Delphix, Broadcom Test Data Manager, Informatica TDM, and IBM InfoSphere Optim that provide data subsetting, masking, and provisioning workflows
- Static data masking — Tools like IRI FieldShield, K2view Data Masking, and DataSunrise that permanently transform sensitive data at rest
- Synthetic data generation — Platforms such as Tonic.ai, GenRocket, Gretel.ai, and YData that create artificial datasets mimicking production patterns
- Periodic database snapshots — Database cloning and snapshot technologies that capture point-in-time database states for testing
These approaches assume:
- PII locations are known and static
- Data is batch-oriented
- Systems change slowly and data has predictable locations
Modern systems violate all three assumptions. As architectures become more distributed and event-driven, these methods struggle to keep up — especially when traffic shape and sequence matter more than individual records. Test data management tools can help minimize storage costs by reducing the number of redundant data copies.
6. What AI coding agents expose: The real problem we need to solve
AI coding agents are exposing a fundamental problem that has been hiding in plain sight. These agents fall prey to the same traps as human engineers:
- Rare edge cases
- Real user behavior patterns
- Long-tail distributions
- Sequential decision-making context
Sanitized or synthetic data often removes the very signals AI coding agents rely on, leading to:
- Overconfidence in test results
- Surprising failures in production
- Slower iteration due to fear-driven release processes
But here’s what makes AI coding agents different: their stochastic nature amplifies the consequences. When you combine non-deterministic AI behavior with synthetic or sanitized test data, you’re compounding uncertainty in ways that make failures more frequent and harder to predict. The AI’s inability to produce reliable code when working with unrealistic data surfaces a truth that human engineers have learned to work around: we’ve been testing with inadequate data all along.
This is not just a compliance problem, and it’s not just an AI problem. The real challenge that AI coding agents are forcing us to confront is:
How do we safely observe and reuse real production behavior to improve software quality?
Until teams can answer that question, testing will remain slower, riskier, and less representative than production demands—whether the code is written by humans or AI.
7. Coming Next: Why Traditional Test Data Management Falls Apart
In Part 2, we’ll examine why classic Test Data Management was built for a different era. We no longer live in a world dominated by batch processing and monolithic databases. Modern systems require streaming, real-time approaches instead.