The below is a repost of his blog.
The challenge with any testing is when we do not have an exact duplicate of production in a lower environment to test on. This is often because we are not using the same data, tests, volume and scenarios that production would see.
Some of the challenges are:
- The right number of concurrent connections
- A specific network stack
- Race conditions
- Loosely coupled services
- Network flakiness
- Ephemeral runtimes
- Specific hardware and configurations
- Build environment
- Deployment code and process
- Cache hits or misses
- Containers / VMs and their bugs
- Specific scheduled jobs
- Clients with their own specific retries, timeouts and load shedding
- The Internet
- Environment settings
In my experience, lower environments are never an exact replica of production due to the following challenges:
- Cluster sizes are different
- Since clusters are smaller, configuration options are different
- The number of connections handles are much lower
- Monitoring is not as extensive as production
- Inability to test specific datasets as it will impede downstream system and database counts
- And much more
We aren’t just testing code anymore. The systems we test are complex. They have unpredictable interactions, some times out of order events/message, and other properties that make it hard to test outside of production.
Think about it…Every time we deploy to production we are testing a unique, never seen / replicated combination of artifact, environment, infrastructure, time of day, etc..
Our applications are being tested every day in production by our customers, we just need to find a way to use all of the data customers are already generating.
With more production data it makes it easier to design load tests that accurately reflect actual server load. Testing is about reducing uncertainty. It is all about risk management and there are many categories of uncertainty that can only ever be truly tested in prod, such as behavioral testing, A/B testing, realistic load testing, etc..
I am a big believer in Blue/Green and canary deployments. With the proper monitoring you can limit the risk and quickly switch back if needed. Here are some other tools and techniques that will allow you to test more safely in production
- Feature flags
- Observability (tracing)
- Soak testing (testing over a long period of time under a realistic level of concurrency and load)
- Chaos Engineering
I am impressed with what Speedscale has built. They allow you to quickly replay past traffic and simulate responses from third party APIs based on real traffic in seconds. It is a traffic replay framework for API testing in Kubernetes. I think this combined with progressive deployments will be the future of testing.
This notion of testing in production isn’t just for applications, but network changes as well. With close-loop test automation we can gain more confidence and trust. This methodology has three stages:
- Pre-approval testing
- Peer review
- Deployment pre-testing
- Before deploying, make sure the network is in a desired healthy state.
- Deployment post-testing
- Test that the change produced the intended behavior
The tools below will help with testing these network changes
Give the ideas and tools above a try to make your changes robust and error free.
I am not saying that everyone can test in production today. It is scary typing it and many may not want to whisper these words out loud, but overtime that should be the goal of any savvy senior engineer.
In my experience, the lower environments are never replicas of production for one reason or another and is almost always a question to be answered during our post-mortem
was this tested in lower environments?
To be able to successful and safely test in production, automation must be solid and your fail-over to a previous state must be instantly available.
Over time you will gain the trust and support of your team and leadership and one day you will test in production ????