Microservices help increase engineering velocity, but most engineers are still in the early stages of understanding best practices for debugging failure modes once these systems enter production. At Speedscale, we regularly diagnose issues in highly network dependent, yet poorly understood, microservices. In other words, we’re a lot like every other SRE responsible for keeping a complex application humming.
Today I want to talk about a debugging edge case. What if we need to capture traffic from the beginning of the container lifecycle to isolate a problem? For example, one of our api gateway containers in a demo app executes a ping to MongoDB immediately upon startup. If MongoDB is not available, the queue deadlocks and certain outbound requests start blocking silently hours later. There is no indication that the initial ping failed or that it is causing subsequent failures. In this situation, the network becomes a place you might want to check out. Now, if you have perfect log messages, 10x engineers and no turnover you’ll never see this problem. Also, let me know where you work because it sounds wondrous.
If you are using Kubernetes and aren’t specifically diagnosing startup behavior, then I highly recommend github/eldadru/ksniff. It makes the process of capturing network traffic in a Kubernetes Pod silky smooth.
However, if you specifically want to grab startup traffic, you can use this quick and dirty technique for capturing and visualizing network startup activity using tools available for any linux distro. No licenses, network SPANs or SaaS services required. Let’s get to it…
NOTE: These tricks assume a Kubernetes container environment with a stripped down Linux image like Alpine. You can easily adapt this technique to other container orchestration environments.
Most production grade containers are stripped down to pass security audits and conserve resources. That’s the right way to manage cattle, but not what we need in this scenario. We need to create a special debug container that includes various network tools that help our analysis.
Add this snippet to your Dockerfile when creating the container to add some helpful network tools:
# Additional network debugging tools RUN apt-get update \ && apt-get install -y --no-install-recommends \ curl \ sudo \ tcpdump \ net-tools \ netcat \ procps \ dnsutils \ unzip \ lsof
For this article, you only really need tcpdump but you’re already breaking the rules so might as well have all the tools you might need.
We have our microphone in place, but now we need to start the recorder as soon as the container starts. To do that, we will create a script on the target container that runs TCPDUMP along with our primary process. For your convenience, here is just such a script called run_tcpdump.sh. Remember to replace speedscale_loves_sres with your actual process startup script name.
#!/bin/bash # turn on bash's job control set -m # Start the primary process and put it in the background /usr/local/bin/speedscale_loves_sres & # Start the helper process sudo tcpdump -w startup.pcap # the my_helper_process might need to know how to wait on the # primary process to start before it does its work and returns # now we bring the primary process back into the foreground # and leave it there fg %1
Now copy this script into the final docker container and set it as the entrypoint by modifying the dockerfile like so:
COPY run_tcpdump.sh /usr/local/bin/run_tcpdump.sh # # # ENTRYPOINT [ "/usr/local/bin/run_tcpdump.sh" ]
If you’re running Kubernetes, that command might look something like this:
kubectl -n speedscale rollout restart deployment tcpdump_container
You now have a PCAP full of network traffic being written to the local disk. Be careful, containers can be chatty and you can run out of disk space very easily.
NOTE: If you run Kubernetes and have permissions to create a VolumeMount, you should do so. We find it’s easy to forget that you can attach storage quickly when you need it.
This step varies wildly based on your container orchestration system… or lack of container orchestration system. There’s a trick for Kubernetes users that makes copying the PCAP file to your local filesystem easy:
kubectl cp /: ~/Downloads/ -c tcpdump_container
Here’s a specific example of this command I used today:
kubectl cp motoshop/moto-api-7d85bf57f8-v4r2v:startup.pcap ~/Downloads/startup.pcap -c goproxy
This method is the happy path but in heavily regulated environments extracting the PCAP can be surprisingly difficult. Our users have been forced to do everything from uploading the file to an AWS S3 bucket using their CLI to running tshark on the container in real time.
NOTE: If you’re running Kubernetes, the kubectl cp (copy) command is incredibly useful for a variety of purposes, not just sneaking out PCAPs.
Don’t leave your debug container running. Your team members and CISO will thank you.
Wireshark is the de-facto standard for open source network analysis GUIs. Much has been written about analyzing TCP dumps with Wireshark over the years. Here’s an easy getting started blog: Julia Evans: How I use Wireshark
That’s it. As I said in the beginning of this post, this is an edge case but it’s one we hit surprisingly frequently. Hope this helps.
Many businesses struggle to discover problems with their cloud services before they impact customers. For developers, writing tests is manual and time-intensive. Speedscale allows you to stress test your cloud services with real-world scenarios. Get confidence in your releases without testing slowing you down. If you would like more information, schedule a demo today!
Stress test your APIs with real world scenarios. Collect and replay traffic without scripting.