15 years ago Google placed its largest server order and did something big starting site reliability engineering
I read this thinking how impactful was this large server order. Couldn’t figure what I would post on how the order is significant.
Then I ran into this post on Site Reliability Engineering dated Apr 28, 2014, and realized there was a huge impact by Google starting the idea of a site reliability engineering team.
Here is one the insights shared.
The solution that we have in SRE -- and it's worked extremely well -- is an error budget. An error budget stems from this basic observation: 100% is the wrong reliability target for basically everything. Perhaps a pacemaker is a good exception! But, in general, for any software service or system you can think of, 100% is not the right reliability target because no user can tell the difference between a system being 100% available and, let's say, 99.999% available. Because typically there are so many other things that sit in between the user and the software service that you're running that the marginal difference is lost in the noise of everything else that can go wrong.If 100% is the wrong reliability target for a system, what, then, is the right reliability target for the system? I propose that's a product question. It's not a technical question at all. It's a question of what will the users be happy with, given how much they're paying, whether it's direct or indirect, and what their alternatives are.The business or the product must establish what the availability target is for the system. Once you've done that, one minus the availability target is what we call the error budget; if it's 99.99% available, that means that it's 0.01% unavailable. Now we are allowed to have .01% unavailability and this is a budget. We can spend it on anything we want, as long as we don't overspend it.
Here is another rule that is good to think about when running operations.
One of the things we measure in the quarterly service reviews (discussed earlier), is what the environment of the SREs is like. Regardless of what they say, how happy they are, whether they like their development counterparts and so on, the key thing is to actually measure where their time is going. This is important for two reasons. One, because you want to detect as soon as possible when teams have gotten to the point where they're spending most of their time on operations work. You have to stop it at that point and correct it, because every Google service is growing, and, typically, they are all growing faster than the head count is growing. So anything that scales headcount linearly with the size of the service will fail. If you're spending most of your time on operations, that situation does not self-correct! You eventually get the crisis where you're now spending all of your time on operations and it's still not enough, and then the service either goes down or has another major problem.