Many organizations are attempting to increase feature velocity and manage costs by adopting the practice of platform engineering (PE). Enterprises usually call Speedscale because they are struggling with software quality or development velocity. However, we’ve learned from experience that many of these types of problems stem from a common belief in outmoded processes and patterns. At its core, PE seeks to modernize software development to better match cloud-native applications.
What is platform engineering?
The term Platform Engineering (PE) continues to evolve but fundamentally it encompasses the practice of building and maintaining the foundational infrastructure and systems that support business critical applications. That usually means pursuing excellence in the domains of CI/CD, performance engineering, security, infrastructure, tooling and most importantly for this discussion, developer experience. In many ways this is an evolution of the discipline of Site Reliability Engineering, DevOps or even further back, systems administration.
Platform engineering vs. Devops
While PE may include disciplines like DevOps, it is a much more expansive domain with a different approach. The differences seem subtle but lead to very different outcomes. In practice, DevOps mainly focuses on applications once they hit production. That means focusing mainly on API observability and security with little time left over for developer experience. At its root, API observability makes applications more… observable. That means improving the quality of application telemetry like logs or metrics. That’s a great start, but my observation is that PE extends the definition of “observable” into experiments designed to expose the behavior of the system. DevOps is mostly passive automation, while PE is active experimentation.
DevOps is mostly passive automation, while PE is active experimentation.
Principle 1: Feedback loops vs automation
The existence of strict human-driven software testing processes should be viewed as an admission of failure. Abide by this maxim and many things in the cloud native world will get easier. Stated more broadly, PE focuses on feedback loops connecting the human to the machine (or machine intelligence) rather than just automation. In the early days of DevOps it was a major efficiency improvement to simply automate deployment using Jenkins. PE, however, takes this practice further than automated deployments by creating a system that developers can directly experiment with. The difference is subtle but the results can be enormous.
For example, let’s consider three feedback loops and their consequences:
1.) Test in prod with fast rollback, features flags and limited blast radius
This is great for having a fast feedback loop but not so great because you end up using your customers as crash test dummies. In some situations that’s ok, like with a broken content feed that you can simply refresh. It’s not as ok when it’s something more oriented around guaranteed delivery like a bill pay service.
The other big disadvantage of this approach is that it requires highly skilled engineers to make intricate decisions about infrastructure design and feature scope to limit blast radius.
2.) Microservices architectures
When microservices became popular it solved a key problem – it reduced the scope of what each eng team needed to know. The smaller the immediate codebase, the faster the feedback loop. However, the complexity didn’t go away, it simply moved. Issues manifest only in the staging environment because the problems move to the interactions between components. Or said differently, debugging involves tracing a large system instead of tracing code in a monolith.
3.) Traditional performance testing and regression testing
Having a formalized software testing process prevents many kinds of errors escaping to production. That’s pretty much the only upside. The downsides are the expense, the fragility of the process and the tax on development velocity. Most organizations are abandoning manual testing.
The focus of PE should be reducing the cost of experimentation while managing tradeoffs. It’s not enough to just automate.
Principle 2: Platform includes the developer desktop
Every engineer knows the value of running the entire system on their laptop. The ability to tinker, rewire and refactor without breaking a real system is invaluable for increasing velocity. That may not be possible for many applications but the concept of giving developers their own sandbox is still crucial. For this reason, platform engineering expands the definition of “platform” to include the dev environment and tooling.
DevOps focuses mainly on automating the delivery of the production application and its infrastructure, while PE focuses on automating the delivery of the developer test environments as well. Most organizations implement one of the following patterns:
- Packaging the application including test and mock data so that it can be simulated on a laptop
- On-demand preview environments, typically built with each merge request and managed by tools like Argo and Flux
- Ephemeral service isolation test environments either local or in a cloud provider like Speedscale
- Realistic centralized test environments with traffic rerouting like Telepresence
Generally these systems are managed by the PE team as a service to the broader engineering team. They need to be managed and designed along with the application.
Principle 3: Everything is ephemeral
Most DevOps practitioners are familiar with the idea of Infrastructure as Code (IaaC) where an entire application environment can be reproduced with the press of a button. This is an excellent start, but Kubernetes takes it further by introducing the idea that everything is short lived. Instead of carefully crafting virtual machines and software defined networking rules, we now design systems around short-lived containers with elastic scaling.
This shift increases feature delivery in a variety of ways from testing to rollbacks. Here are a few specific applications of this idea and its advantages:
1.) Data portability
If your data is stored as JSON files in the cloud, it can be moved and repurposed easily for different use cases. Backups are simple because it’s just copying files. Analytics are easy because you can just ask Athena, BigQuery, Snowflake, etc to traverse it. Machine Learning training becomes easier because the dataset can be segmented and passed around. Compare this with the relational databases of yore with their proprietary formats and backup systems.
How can stable tests be written when the APIs and applications are constantly changing? Stop trying and utilize ephemeral traffic replay instead. The tests and mocks are always refreshed from real user behavior.
3.) Engineering Velocity
Some organizations stand up complete application preview environments for every merge request. These environments may live for an hour or less but they let reviewers interact with the running application and see the code in action. When they’re done, the environment disappears.
It’s harder to get taken in by ransomware if you can press a button and rebuild your systems from code and restore your data.
The easiest way to get started with this concept is to convert your infrastructure to a system like Terraform, CloudFormation or a similar tool. As you progress it becomes necessary to shift to a modern container management system like Kubernetes. Some organizations invest in portals for full deployment automation.
🚫 DON’T treat PE as a rebranding of DevOps
✅ DO learn more about these three key principles and how they apply to your technology stack
🚫 DON’T leave the developer experience as an afterthought
✅ DO design your application platform so it can be scaled up and down
🚫 DON’T treat development and deployment as a linear process
✅ DO identify feedback loops and reduce experimentation effort
🚫 DON’T create static processes and artifacts like virtual machines or testing plans
✅ DO create always-up-to-date feedback loops