Developers: were being told by management to build new features. Operators: were being told by management that everything had to be stable.
Developers are often better connected to the business. And what they’re doing has obviously a measurable effect. On the other hand, ops only got noticed when something broke (even though it might be the developers’ fault).
There was a thick wall between dev and ops.
Devops, in its purest definition, is breaking down the barrier between dev and ops. Slowly it started to extend, for instance to include security. DevSecOps and so.
There are five key areas to devops:
Reduce organisational silos. This is often doable. You could experiment with putting dev and ops in the same room: automatically, they’ll start cooperating.
Accept failure as normal. If you are not allowed to make mistakes, you won’t have much initiative. If you’re job is on the line whenever there is a change, you’ll most likely say “no” if someone requests a change. So: allow failures, that’s much better for your organisation.
Implement gradual change. Release often. Allow for easy rollbacks.
Leverage tooling and automation. Sometimes you hear people say “I installed ansible, so I’m using devops!”. No, that’s not it. The automation is a necessary tool for implementing the other four points.
Measure everything.
All five are abstract ideas. They don’t tell you how to do it. Devops defines the endresult, but doesn’t define how to do it.
SRE (site reliability engineering) is an implementation of those five devops ideas:
(I missed this one)
SRE embraces “blameless post-mortems”. And it allows for failures: they’re budgetted.
Small changes: yes. Ideally a database change is in a different change from the code changes.
Automate everything away that we’re doing manually now..
SRE is obsessed with measuring. Especially because of the SL-somethings.
SL-somethings?
SLI: service level indicator. Often observed by an external system.
SLO: service level objectives. Binding target for SLI.
SLA: service level agreement. Business agreement between a customer ans a provider.
Watch out with the target that you aim for. If your site has 99.9999999% of availability and your users access it through their cellphones…. The availability will look much less. So: don’t go overboard. A high availability costs serious monly: what is your availability budget?
If the availability budget is drained, you cannot deploy new features anymore until the budget is increased again.
SRE looks down upon toil: stuff that is manual, repetitive, automatable, tactical and devoid of long-term value. There’s a budget for reducing toil (and thus increasing automation).
My name is Reinout van Rees and I program in Python, I live in the Netherlands, I cycle recumbent bikes and I have a model railway.
Most of my website content is in my weblog. You can keep up to date by subscribing to the automatic feeds (for instance with Google reader):