Many engineering organizations have recently begun adopting the practice of platform engineering as a way to increase the velocity of new features being released while managing costs. This new approach to the software development process emerged as a result of the increasing complexity of modern architectures. Its primary focus is to modernize software engineering to better match cloud-native applications.
At first glance, it sounds similar to a familiar concept, DevOps. While they both share similar goals, we’ve identified three main principles that make them different.
What is platform engineering?
The term Platform Engineering (PE) continues to evolve but fundamentally it encompasses the practice of building and maintaining the foundational infrastructure and systems that support business-critical applications.
That usually means pursuing excellence in the domains of continuous integration and continuous delivery, performance engineering, security, infrastructure, tooling and most importantly for this discussion, developer experience.
Platform engineering teams are responsible for building and maintaining an internal developer platform (IDP). These internal platforms are designed to improve developer self-service, reduce cognitive load, and automate certain repetitive tasks, boosting developer productivity and accelerating the software development lifecycle.
In many ways, this is an evolution of the discipline of Site Reliability Engineering, DevOps, and even systems administration.
While PE may include disciplines like DevOps, it is a much more expansive domain with a different approach. The differences seem subtle but lead to very different outcomes. In practice, DevOps engineers mainly focus on applications once they hit production. That means focusing mainly on API observability and security with little time left over for developer experience. At its root, API observability makes applications more… observable. That means improving the quality of application telemetry like logs or metrics. That’s a great start, but my observation is that PE extends the definition of “observable” into experiments designed to expose the behavior of the system. DevOps processes focus on passive automation, while PE is about active experimentation.
DevOps processes focus on passive automation, while Platform Engineering is active experimentation.
Principle 1: Feedback loops vs automation
The existence of strict human-driven software testing processes should be viewed as an admission of failure. Abide by this maxim and many things in the cloud native world will get easier. Stated more broadly, Platform engineering focuses on feedback loops connecting the human to the machine (or machine intelligence) rather than just automation. In the early days of DevOps it was a major efficiency improvement to simply automate deployment using Jenkins. PE, however, takes this practice further than automated deployments by creating a system that developers can directly experiment with. The difference is subtle but the results can be enormous.
For example, let’s consider three feedback loops and their consequences:
1) Test in prod with fast rollback, features flags, and limited blast radius
Testing in production environments is great for having a fast feedback loop but it can come at a great cost: your customers end up as your crash test dummies. In some situations it can be passable, like for a content feed that you can simply refresh. It becomes less okay when it’s something more oriented around guaranteed delivery, like a bill pay service.
The other big disadvantage of this approach is that it requires highly-skilled software developers to make intricate decisions about infrastructure design and feature scope to limit blast radius.
2) Microservices architectures
When microservices became popular it solved a key problem by reducing the scope of what each engineering team needed to know. The smaller the immediate codebase, the faster the feedback loop. However, the complexity didn’t go away, it simply moved. Issues manifest only in the staging environment because the problems move to the interactions between components. Or said differently, debugging involves tracing a large system instead of tracing code in a monolith.
3) Traditional performance testing and regression testing
Having a formalized software testing process prevents many kinds of errors escaping to production. That’s pretty much the only upside. The downsides are the expense, the fragility of the process and the tax on development velocity. Most organizations are abandoning manual testing.
The focus of Platform Engineering should be reducing the cost of experimentation while managing tradeoffs. It’s not enough to just automate.
Principle 2: Platform includes the developer desktop
Every engineer knows the value of running the entire system on their laptop. The ability to tinker, rewire and refactor without breaking a real system is invaluable for increasing velocity. That may not be possible for many applications but the concept of giving developers their own sandbox is still crucial. For this reason, platform engineering expands the definition of “platform” to include the dev environment and tooling.
DevOps focuses mainly on automating the delivery of the production application and its infrastructure, while PE focuses on automating the delivery of the developer test environments as well. Most organizations implement one of the following patterns:
- Packaging the application including test and mock data so that it can be simulated on a laptop
- On-demand preview environments, typically built with each merge request and managed by tools like Argo and Flux
- Ephemeral service isolation test environments either local or in a cloud provider like Speedscale
- Realistic centralized test environments with traffic rerouting like Telepresence
Generally these systems are managed by the platform engineering team as a service to the broader engineering team. They need to be managed and designed along with the application.
Principle 3: Everything is ephemeral
Most DevOps practitioners are familiar with the idea of Infrastructure as Code (IaaC) where an entire application environment can be reproduced with the press of a button. This is an excellent start, but Kubernetes takes it further by introducing the idea that everything is short-lived. Instead of carefully crafting virtual machines and software defined networking rules, best practices now say we design systems around short-lived containers with elastic scaling.
Kubernetes preview environments: adoption, use cases & implementations
Learn what Kubernetes preview environments are, how they’re used and why they’re growing in popularity
This shift accelerates development processes in a variety of ways, from testing to rollbacks. Here are a few specific applications of this idea and its advantages:
1) Data portability
If your data is stored as JSON files in the cloud, it can be moved and repurposed easily for different use cases. Backups are simple because it’s just copying files. Analytics are easy because you can just ask Athena, BigQuery, Snowflake, etc to traverse it. Machine Learning training becomes easier because the dataset can be segmented and passed around. Compare this with the relational databases of yore with their proprietary formats and backup systems.
2) Testing
How can stable tests be written when the APIs and applications are constantly changing? Stop trying and utilize ephemeral traffic replay instead. The tests and mocks are always refreshed from real user behavior.
3) Engineering velocity
Some organizations stand up complete application preview environments for every merge request. These environments may live for an hour or less but they let reviewers interact with the running application and see the code in action. When they’re done, the environment disappears.
4) Security
It’s harder to get taken in by ransomware if you can press a button and rebuild your systems from code and restore your data.
The easiest way to get started with this concept is to convert your infrastructure to a system like Terraform, CloudFormation or a similar tool. As you progress, it becomes necessary to shift to a modern container management system like Kubernetes. Some organizations invest in portals for full deployment automation.
Key takeaways
🚫 DON’T treat PE as a rebranding of DevOps
✅ DO learn more about these three key principles and how they apply to your technology stack
🚫 DON’T leave the developer experience as an afterthought
✅ DO design your application platform so it can be scaled up and down
🚫 DON’T treat development and deployment as a linear process
✅ DO identify feedback loops and reduce experimentation effort
🚫 DON’T create static processes and artifacts like virtual machines or testing plans
✅ DO create always-up-to-date feedback loops
Build your Internal Development Platform (IDP) with Speedscale
By recording production traffic and replaying it in test environments, Speedscale helps platform engineers build their IDP with realistic data and mocks. Learn more about production traffic replication in our Definitive Guide to Production Traffic Replication and Replay or sign up below to try us out for free.