The Trap of Immediate Improvement in Software Operations Practices

The Illusion of Immediate Improvement

It’s important to start with the observation that some generalizations made in this article are just that - generalizations - and do not and could not describe the reality of an entire sector. The same warning applies to the hyperboles used in the text.

The consolidation of software operations practices, such as DevOps, SRE, and Platform Engineering, has established a clearer path of convergence for software tools and architectures.

This direction of tools seems to have created a finish line or target for various organizations, but for teams that haven’t reached a certain degree of technological maturity, it can create the illusion of immediate improvement.

Immediate improvement is precisely the idea that there exists a specific action or tool that, beyond any doubt, will lead to a situation of instant improvement. It’s as if there was a direct way to solve the wiring problem of this pole, ignoring the circumstances that created it.

Image of a power pole with disorganized wiring structures referencing complexity

Continuous Improvement Can Intimidate While Immediate Improvement Can Attract

Despite all the advances in software engineering in recent decades, the idea and process of continuous improvement can still scare a series of organizations. The growth of technological maturity and the cultural gain of teams, which are both results of the continuous improvement process, end up being seen as obstacles to delivery in certain scenarios.

Many software professionals don’t understand that paradigms like container orchestration, cloud, and CI/CD were developed over time, resulting from try and error. These paradigms are the fruits of learning to find a way to improve a complex scenario through new abstractions.

In this scenario of abstractions, organizations can be led to error and buy into the idea that certain tools or paths of other organizations are fundamental to their own success, even without the complexity that made that choice necessary.

In this way, the idea of immediate improvement ends up being a much more attractive strategy for teams, as it doesn’t require a change in culture or processes. It’s the difference between a marathon competition and a sprint race in athletics. Although the mechanical action is basically the same, the process is completely different.

Being DevOps Is Not Just Having a Deployment Pipeline

Meme about DevOps

This issue, despite seeming trivial to those who are familiar with it, is usually one of the many illusions of immediate improvement. Having a deployment pipeline is not enough to name teams as DevOps.

The whole idea of deployment pipelines is something explored even before the consolidation of the term DevOps. The famous book Continuous Delivery (Humble and Farley) already announced, in 2010, some principles that would enter the DevOps culture playbook such as automation, frequent testing, collaboration between teams, and accountability for processes.

Even so, it’s common to find the anti-pattern of organizations that have deployment pipelines, but zero collaboration between teams and even the need to manually monitor automated executions of deployment processes.

This scenario, in Humble and Farley’s work, would not be considered Continuous Delivery and shows weak adherence to DevOps culture. Without proper technological maturity, automations not only fail to remove the risk of errors but introduce new types of risks.

Therefore, being DevOps requires a cultural transformation achieved with the technological maturity of the continuous improvement process, which allows for understanding flows, documenting automations, and creating a culture of accountability for processes. It’s difficult to achieve this in an immediate way.

Being in the Cloud Is Not Being Cloud Native

Meme about Cloud Native

Here, we have another point that seems obvious, but which becomes a point of tension for many organizations - especially those that have recently migrated to the cloud or are in the process of migrating.

Being in the cloud is a relatively easy process, I consider it even easier than sustaining your own infrastructure currently. Migrating to the cloud requires some maneuvers, but even in this scenario, having services functioning is relatively simple with minimal process maturity.

Despite this, taking advantage of the benefits that are usually sold by a Cloud provider (scalability, reliability, security, and others) usually requires a certain adaptation. Such adaptation is quite resistant to the illusion of immediate improvement, as it tends to be a very iterative process and not an immediate paradigm shift.

Taking a monolith and throwing it into the cloud in a container with the expectation of being Cloud Native with all the bells and whistles is one of the main illusions of immediate improvement. Software is not born Cloud Native and planning is necessary for this change.

Observability Is Not Creating Dashboards for the Platform

Since observability engineering began to gain strength in community discussions, mainly SRE, the technical and cultural challenge of observability processes became evident. Perhaps of all the items mentioned so far, this is one of those that requires greater cultural maturity of an organization.

Monitoring a platform, to a certain extent, is relatively simple, given that most Cloud providers already have some basic integrations for primitive types of monitoring.

However, observability engineering arose precisely because the idea of monitoring was not enough for complex contexts. The illusion of immediate improvement with dashboards is one of the main blind spots of observability engineering, it’s not uncommon to find platforms experiencing problems, but the dashboards can’t even capture something wrong and incidents are notified by the users.

Observability Engineering requires continuous learning, in addition to the need to maintain a minimum degree of collaboration between teams to be able to adapt the infrastructure and software to a higher level of observability.

Observability, so far, is not plug-and-play and needs to go through the continuous improvement process to achieve the desired objectives.

The Argonaut’s Poison

Meme about Kubernetes and complexity

Something that immediate improvement usually hides is the need to review ideas and processes. Software Engineering is also a victim of hype cycles, but continuous improvement imposes the obligation to take two steps back when going to a wrong direction or a bigger step than desired is detected.

One of the tools in its hype cycle in the Ops world is Kubernetes. A fantastic container orchestration platform, inspired by Borg - Google’s internal project to solve its operation problems. It’s undeniable that it’s one of the greatest engineering feats in the operations world in recent decades, but it’s definitely not the master key for all platforms.

Organizations that never considered clustering started adopting the tool in the expectation that it would lead to the path of success in Operations, another face of immediate improvement. Kubernetes has fascinating use cases and many success stories of operations simplification, but it requires a series of precautions.

Some organizations may have extremely basic use of Kubernetes - which in itself causes reflection on the need for the tool as a whole, while other organizations may need even encryption on the internal network of the Cluster. A single size or single color of cluster will not solve the problem of all platforms.

On more than one occasion I have already recommended the deactivation of a Kubernetes cluster because it was introducing more technical debt than value to the platform. I believe I will continue to do so in the future.

Believing that the creation and use of a cluster, without a continuous improvement process, will immediately solve technical and cultural problems is one of the biggest traps regarding k8s.

Some cases that share similar reflections: Why We Don’t Use Kubernetes, Kubernetes Wasn’t a Good Fit for Us and Who Left Kubernetes and Why.

The difference between medicine and poison is the dosage.

The Importance of Technological and Cultural Maturity

The pursuit of improvement, without the respective technological and cultural maturity, can lead to a significant increase in complexity. This is because the illusion that cutting-edge tools and practices can be adopted without a gradual process of adaptation and learning ends up contributing to the emergence of problems and incidents.

Therefore, the improvement of platforms rarely - if ever - comes from shortcut solutions, but from a continuous commitment to technological and cultural evolution. Such commitment is something built daily in teams and platforms, through an iterative process of learning, adaptation, and refinement.

The weakest link in a platform is the human being who builds it, as it’s one of the few parts we can’t define with code. Yet.