The main idea behind Kubernetes is to create a standardized approach to running containers in the cloud. Whether you’re running AKS on Azure or EKS on Amazon, your cluster should still behave in more or less the same way.
But that’s not to say you’re locked in to doing things one way; Kubernetes still offers a lot of flexibility in many cases. This is what experienced engineers take advantage of when trying to optimize Kubernetes performance.
In part 1 of this series, you got an overview of what traffic replay is, and how it can help you discover optimizations within the configurations of your cluster.
Part 2 will continue the idea of using traffic replay, showcasing how comparisons can become much more powerful and deterministic.
Creating Powerful Comparisons
When it comes to optimizing your Kubernetes cluster, there are a multitude of factors to consider; from storage classes to monitoring solutions and even the underlying infrastructure. There are many ways to go about testing these elements, but being able to handle multiple different configurations—while still being deterministic—is absolutely crucial.
Below, you’ll see how traffic replay fulfills this requirement.
Comparing Storage Classes
Storage classes are one of the concepts within Kubernetes that you’ll likely never need to know about. In most scenarios the default storage class is good enough.
However, at a certain scale or with a complex application, using the right storage class can have a significant impact on your performance.
Imagine you’re running a large e-commerce website, storing large amounts of data like product images and customer information. Being able to access product images quickly is essential, as this impacts load time and subsequently the user experience.
On the other hand, accessing customer information—like when someone’s at the checkout—needs to be quick, but not necessarily as quick as accessing product images.
Because of this, it makes sense to use two different storage classes: one that’s costly but blazingly fast, and one that’s cost-efficient but a bit slower.
To compare different storage classes, you need to run the application and generate load. This can be done either manually or by using tools such as traffic replay.
Doing this manually can be a time-consuming process, as you will have to configure the requests, modify the manifest file to use a different storage class, execute the requests, collect the data, and finally analyze it. Not to mention how you won’t necessarily be getting a realistic performance overview.
A tool like Speedscale is able to handle this entire process automatically, given that you’ve captured traffic from your production services. This is possible as Speedscale allows you to add annotations to your existing manifest files, like you see here:
apiVersion: apps/v1
kind: Deployment
metadata:
name: cart-service
annotations:
replay.speedscale.com/snapshot-id: "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"
replay.speedscale.com/testconfig-id: "standard"
replay.speedscale.com/cleanup: "inventory"
sidecar.speedscale.com/inject: "true"
...
In case you’re not familiar with Speedscale, the snapshot-id
refers to a snapshot of captured traffic, which is what will be replayed against your service.
Because these annotations can be added to any Deployment, it’s easy to make a script that automatically goes in and changes the storage class, and then redeploys it. Once the Speedscale Operator notices a Deployment with these annotations, it’ll automatically start generating traffic. And, the cleanup: "inventory"
ensures that the test resources are removed afterwards as well, meaning you won’t have to worry about wasting resources.
In practice, this allows you to create a repeatable, deterministic, and realistic test suite. And, Speedscale will even collect and analyze metrics for you when it’s done running.
Comparing Monitoring Solutions
Getting set up with a good monitoring solution is critical for most organizations, which means you’ll likely have to compare solutions at some point.
A lot can be determined from a solution’s marketing material and documentation, however some things have to be tested out for yourself. Only by using the solution can you tell whether one introduces higher latency, consumes more resources, is more complex to implement in your exact infrastructure, etc.
Monitoring your infrastructure—especially cloud infrastructure—isn’t as simple as collecting metrics on CPU and RAM. Not only will you likely want to see correlation between these metrics, but also additional metrics like ingress, availability, health, and more.
For this reason, actually setting up and comparing monitoring solutions can be a big aid in making the right choice.
Though in most cases it’s possible to instrument an application with multiple monitoring agents, it’s not recommended for resource and performance reasons. Also, you risk the agents interfering with each other, ultimately misrepresenting true performance.
On the other hand, testing one solution at a time can result in a misrepresentation of true performance as well. While most organizations won’t experience major fluctuations in traffic from day to day—or week to week—any difference in the incoming traffic will lead to an uneven comparison.
So, at this point you might think that you can just spin up the different monitoring solutions in a test environment, and then generate traffic towards them. And you’d be right.
There is a catch, however. If you are configuring requests manually, you’ll be getting an even but unrealistic comparison. Creating the same kinds of fluctuations in traffic, request paths, number of requests, etc., is a time-consuming, if not impossible, task.
It’s crucial that the traffic generated in the test environment is realistic, fully replicating the behaviors you see in your production cluster, which is of course where traffic replay excels.
Whether you choose to spin up the different monitoring solutions one at a time or concurrently, traffic replay ensures that the incoming traffic is the exact same every time.
This will ultimately lead to a much fairer and useful comparison, with the added benefit of giving you a realistic view into how the solution behaves once it’s deployed in production.
Measuring the Impact of Underlying Infrastructure Changes
For most organizations it won’t matter whether you’re running your cluster in AKS or EKS, and will instead depend on what other infrastructure you’re running. If all your other infrastructure is running in AWS, it’s unlikely you’ll decide to run Kubernetes in Azure.
But, you may reach a point where you want to consider other Kubernetes providers—whether it be cost, features, performance, or something else entirely. Being able to create a realistic comparison is essential, as this will have major consequences on the future of your cluster.
Like with the other examples in this post, the major benefit of traffic replay here is how it helps you create a repeatable and realistic test suite. However, Speedscale specifically has an option that takes this to the next level.
If you’re at the point of looking into different Kubernetes providers, you’re most likely running an application of significant proportions. Because of this, it may be necessary to ensure that tests on different providers are running at the same time.
Because Speedscale works by installing an Operator in each of your clusters, it’s capable of interacting with them all through a single interface, meaning you can start a test in multiple different clusters without having to switch context.
You can do so via the WebUI, but for true concurrency you’ll likely want to incorporate it into a script. Earlier, you saw how this can be done via annotations, but it’s also possible to start tests with the speedctl
CLI tool:
$ speedctl infra replay \
--test-config-id standard \
--snapshot-id xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx \
--cluster <cluster-name> cart-service
As you can see, executing tests in another cluster is as simple as modifying the --cluster
parameter. And again, this is done through the Speedscale Operator, meaning you won’t have to think about switching your kubectl
context.
Companies have already leveraged Speedscale to perform these kinds of comparisons, with one company comparing Google’s Tau VMs against another cloud provider, leading to the discovery of 44% better performance. Ultimately, this has netted 7-figure savings in a cloud bill.
What Comparisons Will You Make?
Ultimately, traffic replay isn’t the only way to test infrastructure changes, but it’s by far the most optimal solution for most use cases. Whether it’s the right solution for you is a decision you must make.
Hopefully this post has at least given you a better idea of how you can use it in your own comparisons. Remember that these are just a few examples, and you’ll likely find many more use cases once you start using it yourself.
At this point, you’ve seen in part 1 how configurations in Kubernetes can be optimized, and this part has shown you how to create powerful comparisons. Look out for the next and final part, which shows you how to streamline your development process.