Read in PT-BR

The Incident

This week I participated in a conversation about Error Budget on platform X (formerly known as Twitter). However, when I tried to send my message, nothing happened. I tried again, without success. When I opened another tab, I noticed that the social network was not loading. At that moment, I thought it was an unavailability of X, something that has been criticized since its recent acquisition. An hour later, I began to receive complaints from colleagues with the same problem, clearly upset about the unavailability of the social network.

Interestingly, Twitter became unavailable just as we were discussing Error Budget.

As a good SRE, I began to investigate, but quickly realized that the problem was not limited to Twitter. After about an hour, it became clear that several services were unavailable. This raised speculations: “Is internet provider Y having problems?”. At that moment, WhatsApp groups were already discussing the issue and tagging me, questioning what was happening with the country’s internet.

It didn’t take long for the supposed culprit to be identified: a problem at an Interconnection Point (PIX) operated by ix.br (a national project that promotes direct interconnection between Brazilian networks). A small notice on their status page informed those interested about the situation (screenshot follows). However, for most users, this made no difference: the service they wanted to use was unavailable for a few hours.

ix.br status

At that moment, all I could imagine was a Software Engineer having to explain to stakeholders that the unavailability was not caused by a problem in their service, but rather by a problem in a service over which they had no control.

In this incident, frustration was inevitable, both for the operations team and for the end users. In situations like this, where systems dependent on external infrastructures face interruptions, reality becomes particularly challenging. For the user, the distinction between an internal failure and a problem in an outsourced service is irrelevant; what matters is that access to the desired service is blocked. This perception amplifies the frustration of the operations team, because, despite efforts to keep everything working perfectly, external factors can subvert these efforts and directly impact the user experience.

Unavailability is Inevitable

Such an event made me reflect on the nature of reliability and recall some important lessons about accepting risk as an SRE, like this excerpt from the book Site Reliability Engineering:

“Extreme reliability has a cost: maximizing stability limits the speed at which new features can be developed and how products can be delivered to users, and drastically increases its cost, which in turn reduces the number of resources a team can offer. In addition, users often do not notice the difference between high reliability and extreme reliability in a service, because the user experience is dominated by less reliable components, such as the cellular network or the device they are working with.”

Several factors can cause or create a perception of unavailability. For our services to function properly, they depend on a series of other services, some we can mitigate and think about recovery methods, others not.

For this reason, thinking about extreme reliability can be a mistake for most cases. After all, the user experience is dominated by less reliable components and there will eventually be a perception of unavailability.

Several outages across the internet

So, whether it’s something within our control or not, the unavailability of a service is inevitable and will happen. The big question is: how do we deal with it?

This acceptance does not imply negligence, but rather a change in perspective: instead of focusing solely on prevention, we, as operators of the SRE culture, can focus on efficient management and rapid recovery from incidents.

A realistic perspective on reliability and risk

The fear of unavailability is a constant reality for system operators, but the SRE framework offers tools like Error Budget and Service Level Objectives (SLOs) to manage this challenge. These tools allow for a more realistic approach to reliability, helping to quantify acceptable risk and balance the need for innovation with maintaining stability.

The practice of SRE promotes a culture that balances risk with innovation, encouraging experimentation within clear and safe limits. This approach has been fundamental for the adoption of previously considered risky processes, such as Trunk Based Development, Continuous Deployment, and Continuous Integration.

The Error Budget, in particular, is a valuable tool that supports responsible innovation and data-based decision-making. It provides an objective metric, based on business indicators, to determine risk tolerance, whether in terms of permitted downtime or the acceptable number of errors.

Accepting the inevitability of unavailability and using the Error Budget strategically is essential for effective system management. This does not imply a renunciation of reliability, but rather recognizing that truly reliable systems are those prepared to respond and recover from failures quickly and efficiently.

Conclusion - Embracing Unavailability as a Reality

The incident experienced by a portion of Internet users in Brazil this week serves as a practical reminder that, even with the best plans and robust systems, external factors and unforeseen events can affect the user experience and the perception of unavailability. This event underscores the fundamental importance, in the discipline of SRE (Site Reliability Engineering), of understanding and accepting that this risk is inevitable.

Just like the unavailability faced, where external factors left us dependent on elements outside our control, the Error Budget helps us to understand and accept that not all risks are avoidable. Instead of seeking unattainable perfection, we should focus on creating resilient and adaptable systems, capable of evolving and recovering in the face of challenges.

For this, we need to be willing to accept the risk and treat system operation with a more realistic approach.

Despite recognizing the exceptionality of the unavailability mentioned, the situation reinforces that true reliability is not based on the absence of failures, but on the ability to manage them efficiently. By accepting and intelligently managing risk, we keep our operating systems up and prepare them for a future of continuous innovation and sustainable growth.