Dale Frohman: Should we test in production?

It’s always exciting when your work is recognized! We appreciate that Dale Frohman chose to include us in his most recent blog on LinkedIn - Should we test in production?

The below is a repost of his blog.

The challenge with any testing is when we do not have an exact duplicate of production in a lower environment to test on. This is often because we are not using the same data, tests, volume and scenarios that production would see.

Some of the challenges are:

The right number of concurrent connections
A specific network stack
Race conditions
Loosely coupled services
Network flakiness
Ephemeral runtimes
Specific hardware and configurations
Build environment
Deployment code and process
Cache hits or misses
Containers / VMs and their bugs
Specific scheduled jobs
Clients with their own specific retries, timeouts and load shedding
The Internet
Queues
Humans
Environment settings
Etc…

In my experience, lower environments are never an exact replica of production due to the following challenges:

Cluster sizes are different
Since clusters are smaller, configuration options are different
The number of connections handles are much lower
Monitoring is not as extensive as production
Inability to test specific datasets as it will impede downstream system and database counts
And much more

We aren’t just testing code anymore. The systems we test are complex. They have unpredictable interactions, some times out of order events/message, and other properties that make it hard to test outside of production.

Think about it…Every time we deploy to production we are testing a unique, never seen / replicated combination of artifact, environment, infrastructure, time of day, etc..

Our applications are being tested every day in production by our customers, we just need to find a way to use all of the data customers are already generating.

With more production data it makes it easier to design load tests that accurately reflect actual server load. Testing is about reducing uncertainty. It is all about risk management and there are many categories of uncertainty that can only ever be truly tested in prod, such as behavioral testing, A/B testing, realistic load testing, etc..

I am a big believer in Blue/Green and canary deployments. With the proper monitoring you can limit the risk and quickly switch back if needed. Here are some other tools and techniques that will allow you to test more safely in production

Feature flags
Observability (tracing)
Soak testing (testing over a long period of time under a realistic level of concurrency and load)
Profiling
Teeing
Chaos Engineering
Speedscale

I am impressed with what Speedscale has built. They allow you to quickly replay past traffic and simulate responses from third party APIs based on real traffic in seconds. It is a traffic replay framework for API testing in Kubernetes. I think this combined with progressive deployments will be the future of testing.

This notion of testing in production isn’t just for applications, but network changes as well. With close-loop test automation we can gain more confidence and trust. This methodology has three stages:

Pre-approval testing
- Peer review
Deployment pre-testing
- Before deploying, make sure the network is in a desired healthy state.
Deployment post-testing
- Test that the change produced the intended behavior

The tools below will help with testing these network changes

Give the ideas and tools above a try to make your changes robust and error free.

I am not saying that everyone can test in production today. It is scary typing it and many may not want to whisper these words out loud, but overtime that should be the goal of any savvy senior engineer.

In my experience, the lower environments are never replicas of production for one reason or another and is almost always a question to be answered during our post-mortem

was this tested in lower environments?

To be able to successful and safely test in production, automation must be solid and your fail-over to a previous state must be instantly available.

Over time you will gain the trust and support of your team and leadership and one day you will test in production ????

Dale Frohman: Should we test in production?

Get started for free