Devopsdays 2019: observability for emerging infra - Charity Majors¶

Tags: devopsdays

She had some comments at the start. “The only good diff is a red diff”. “Junior engineers ship features, senior engineers kill features.” :-)

There are three main elements of observability: metrics/monitoring, logs, tracing. Logs are strings, mostly, so mostly useless. Monitoring is lots of data, but it is mostly out of context.

“What got you here won’t get you there”. She mostly means developing and deploying code. Releasing and deploying used to be “flipping a switch”. Now it starts to be more continual. Many more releases (and some rollbacks). And “getting it into production” should be embedded throughout the process. It should move earlier right into the process of writing the code.

What are the markers of a healthy engineering team? A good measurement is “how quickly does your code end up in production?”

If you can get stuff to work with a regular old-fashioned LAMP stack: please do. Monitoring is easy. Many companies have much more elaborate and complex systems. Observability is really a problem. You basically have to deal with unknown unknowns. If your pager goes you often say “hey, that’s new!”…

Distributed systems have an infinitely long list of almost-impossible failure scenarious that make staging environments practically worthless. Lots of things will go wrong in production that are irreproducible in staging. Or they’ll go wrong in one part of your production system and never in the other part…

As a software developer, operational literacy is not nice-to-have, it is essential.

Well, you need observability. Monitoring is not enough. See the monitoring is dead talk (note by Reinout: I hope that is the correct link). Monitoring hasn’t really changed in the last 20 years. It basically can only handle what you can predict beforehand.

Observability is a measure of how well internal states of a system can be observed by looking at it from the outside. So: can you answer new questions by looking at your system’s available outputs? Important point: can you answer those questions without deploying new code?

Complexity is exploding everywhere. Monitoring is intended for a predictable world.

Testing in production: do it. Everyone does it. The bad ones just don’t admit it. She mentioned something about seniour engineers: you trust their instincts. If a senior dev says “I have a bad feeling about this”, you stop and investigate. So you want the senior devs (and all others) to hone their experience on the actual production systems. If you “train” them on the staging systems, you’re training them on wrong data.

Three principles of software ownership:

They who write the code …
… can and should deploy their code …
… and watch it run in production.

You need to develop a good feel for it. Don’t only look at it when it breaks, but observe it regularly. See if it behaves in the way you expect it to. You need experience! “Muscle memory”.

What you want to accomplish is that problems are corrected and bugs are fixed before the customers find out about it. You want to build a well-behaved system. It is not just OPS that should be on call in the middle of the night. DEV should be, too. Only then will you write well-behaved code.

As a dev, you should spend more time observer your real system. Spend less of it in your artificial environment.

Software should be/have:

well-instrumented
high cardinality
high dimensionality
event-driven
structured
well-owned
sampled
tested in production.

Reinout van Rees

My name is Reinout van Rees and I program in Python, I live in the Netherlands, I cycle recumbent bikes and I have a model railway.

Weblog feeds

Most of my website content is in my weblog. You can keep up to date by subscribing to the automatic feeds (for instance with Google reader):