In radiology, people take a long time to become experienced. Medical school, MD, certified radiologist… And when they’re 68 they’re off to a pension. What they did at Quantib was to try and “scale radiology experience with AI”.
Detection and classification of prostate lesions. Same with breast MRIs. Brain shrinkage. They hope it increases the amount of MRI scans that can be processed. And also the quality of the analysis.
He demoed the application. There’s detection of brain regions in the software, for instance. When you compare two MRI scans at different points in time, you can see the difference and compare that difference with what you would see in a healthy person.
Hospital practice often means downloading radiology RMI images from a central hospital image storage server (“PACS”), taking them to a separate workstation for analysis and then going back with reports. This takes time, so it is sometimes omitted due to time pressure…
What they’re working on now is to run their AI software on a server and connect it to the image storage service. They designed their software as a bunch of microservices. Storage service, import, dispatch, workflow service, processing.
Nice idea: you can add exporter plugins to the system by means of docker containers.
Better scalable. AI on GPU nodes can be expensive. So it is more cost effective to only have to scale those AI services there and use regular nodes for the rest.
It is easier to reason about a separate service in isolation. Failure modes and security is easier to figure out. And, important for a hospital, regulatory requirements are better manageable: risk management, cybersecurity.
Of course, testing in isolation is easier.
Microservices are a bit harder to setup than a monolith. Especially when a large part of the team isn’t really experienced with devops type of work.
The core services and the front end are down with python and django. The services also mostly use django restframework. All the communication between the services is done with REST APIs. Extensions also talk to the APIs. Django restframework is mostly straightforward to use.
When designing an API, make it a nice clean clear consistent REST API. Follow REST good practices. Plural nouns (workflow/workflows). Use HTTP verbs (get/put/post/delete). If resources are nested, also nest them in the URLs. A puzzle: using the right HTTP status codes. There are nice decision trees available for that online. Don’t compromise!
Deployment: docker in swarm mode (they’ll move to kubernates later). Docker secrets. Gunicorn+nginx. TLS everywhere: both ways between services. Regular single-way between the browser and the front-end service.
Bas likes using programming in his life. For instance home automation: https://github.com/basnijholt/home-assistant-config
He didn’t care about home automation until he found a way to do it in python (home assistant) and he had a good use case. The use case was the elaborate video/audio system of a family member where they were moving in. It should not take six different buttons to finally get the TV running. Time to automate it.
Home automation is an expensive and time consuming hobby (“if it doesn’t cost time and if it doesn’t cost money, it is no hobby”). Changing lights. Turning heating on or off. When you go to the bathroom at night after you’ve been sleeping, don’t turn on the bright light in the toilet, but use a soft red light. Controlling the robot vacuum cleaner to only do its work when everyone is out of the house. Using a smart current meter connected to the washing machine that sends a message to your phone when it is ready. A packet sniffer between the regular thermostat and the heater to intercept and control it. A humidity sensor in the bathroom to detect when you’re showering: then the lights should stay on despite there being almost no movement :-)
Home automation should be fun and (mostly) useful. It should not invade your privacy or complicate your life.
Regarding complication, two things to keep in mind from the python philosophy:
If the implementation is hard to explain, it is a bad idea.
If the implementation is easy to explain, it may be a good idea.
So: home assistant. The big problem that it solves is that it ties everything together: all the various protocols (wifi, bluetooth, infrared, etc), all the various devices (temperature, humidity, switches, cameras, sockets, etc) and all the various companies… It is written in python. You have abstract “Devices” classes that can be subclassed. And there are lots of examples.
It is open source. Really open source, as it is in the top 10 github projects when you look at the number of contributors. There are lots of active developers. There are even four full time developers paid for by home assistant users!
He then showed his dashboard… A list of plants with their humidity level, for instance. Energy usage. Which lights were on or off. He sent his robot vacuum to a certain room through the web interface. He also showed a video he recorded: nice!
To start with, a raspberry pi and some sensors is enough. Probably you already have a few devices in home already that you can connect.
Sander works at down detector. A service that detects when something is down. They monitor loads of services (facebook, etc). Often they notice it earlier than the actual service itself.
They make most of their money from enterprise subscriptions that use it to monitor their own services and also the services they in turn depend on.
They’re using python and django and started in 2012. They initially used python-nltk to scrape twitter messages to determine if there was an outage for a certain service.
They started on physical servers (which he hates, as they tend to die sometimes), then moved to AWS and they’re now using serverless a lot. For serverless they switched parts from django to flask. Django is now used for database migrations and the admin, mostly.
Basically: async everything. A server creates jobs in redis, workers get jobs. A separate service monitors the queue size and increases and decreases the number of workers.
They use python RQ, “easy job queues for python”, which works with redis. He is really enthousiastic about it. It is really simple to use.
He then explained their setup, which uses loads of amazon services. A question from the audience was “don’t you have extreme lock-in this way?” His answer was: “if you use the cloud, go all-in”. If you can only use a small subset because you might want to move to a different cloud provider, you’re missing out on a lots of stuff. You ought to just use a regular virtual server, then. Much cheaper. If you have the money to use the cloud, go all in. Use all the nice tools and all the managed services.
What they also like: python’s
@lru_cache cache decorator. Also: “black”
for code formatting. Flask. Pipenv. https://codecov.io. statsd. Grafanacloud.
He quicky showed some personal projects at https://github.com/ambardas .
Based on the book “deep work”, he wrote https://github.com/ambardas/make_my_day_planner to re-schedule his google calender a bit.
In between he showed how to use pytest, doctests and coverage. And github actions to automatically run it on github. Note: quite a lot of audience members mentioned that they like github actions, especially the speed.
Fun: https://github.com/ambardas/sorting_performance (currently, look in the development branch . A small project to determine the optimal on-a-table sorting process for supermarket footbal cards. You can optimize for speed or for you-can-do-it-while-doing-other-things.
See https://visualgo.net/bn/sorting for nice visualisations.
I have been getting periodic deprecation notice emails from github for the last few months:
You recently used a password to access an endpoint through the GitHub API using okhttp/2.7.5. We will deprecate basic authentication using password to this endpoint soon:
We recommend using a personal access token (PAT) with the appropriate scope to access this endpoint instead. Visit https://github.com/settings/tokens for more information.
Thanks, The GitHub Team
@nenskins user, that is our old jenkins instance talking to
github somehow. Apparently through basic auth. Only… where? Most of the
github traffic seemed to use just an access token. Jenkins calls that the
secret text type. Basic auth is type
username with password in
What it turned out to be was the github branch source plugin. This periodically looks at our github organisation to see if there are new projects or new branches that it missed. Normally github tells our jenkins when there’s a new project or pull request or so.
Ok, on to the jenkins settings for my organisation. The confusing thing here is that the “credentials” setting says this:
Note that only "username with password" credentials are supported. Existing credentials of other kinds will be filtered out. This is because jenkins exercises GitHub API, and this last one does not support other ways of authentication.
Huh? Github is refusing user/password basic auth, which is what this plugin only supports? I updated every plugin, but the problem still persisted.
Isn’t that message saying that you can continue to use basic auth so long as instead of using your actual password you use a personal access token. Generate a personal access token from the GitHub “Settings” page and store that personal access token in the Jenkins username / password credential as the password. Place your username as the username. Check that it works. It has been working that way for me.
Ah! So “github is refusing user/password basic auth” really means “github is refusing user/password basic auth”. Using an access token instead of your password is actually fine.
The info in jenkins on those credentials actually mention that somewhat:
If your organization contains private repositories, then you need to specify a credential from an user who have access to those repositories. This is done by creating a "username with password" credential where the password is GitHub personal access tokens. The necessary scope is "repo".
So I visited https://github.com/settings/tokens and generated a new token with full “repo” rights (this is actually quite restricted in scope, despite the name).
In Jenkins I added a new global username/password credential with the github username + the access token and hurray, everything worked again.
I attended the kubernetes meetup in Amsterdam on 2019-10-02. Here are my summaries of the talks :-)
Alex is both a founder of StorageOS and a co-chair of the CNCF storage SIG. So he’s got two hats. More details on the SIG: https://github.com/cncf/sig-storage
Why is storage important? Well, there’s no such thing as a stateless architecture, in the end. So something needs to be stored somewhere. Containers are nicely portable, but if the storage it needs isn’t portable…. That’s why it is important.
The SIG wrote a summary of the storage landscape: https://github.com/cncf/sig-storage . Normally, you had to work with whatever storage your company’s IT department was using. Now developers get a say in it.
Storage has attributes: availability, performance, scalability, consistency, durability. But they can mean different things to different people. Performance might mean “throughput” but also “latency”, for instance.
You can categorize storage solutions: hardware, software, cloud services. “Software” then means “software-defined storage on commodity hardware” and often tries to “scale out”. “Hardware” is much more specialized and tries to “scale up”.
Another categorization: access via volumes (blocks, filesystem) and access via an API (like object stores). Kubernetes mostly deals with the volumes kind.
Data access: file system, block, object store. All of them are better/worse suited for different tasks. You won’t use an object store for low-latency work, for instance.
A big differentiator: storage topology. Centralised, distributed, sharded, hyperconverged. “Centralised” often means proprietary hardware. “Distributed” often uses a shared-nothing architecture with regular hardware. “Sharded” is often good at spreading your load, but it can be very tricky to get right. “Hyperconverged” means that nodes are used for both storage and computing.
Another aspect: data protection. RAID and mirrors for local disks. Or replicas of entire nodes. Erasure coding: quite extreme distribution, that’s why amazon’s s3 can claim six 9’s of durability.
Kubernetes has the CRI (the Runtime interface) and the CNI (network). It now has a CSI: storage. Kubernetes is a container orchestration solution, so it really also needs to talk to the storage layer.
How k8s progressed:
K8S native drivers: hard to debug and update.
Docker volume driver interface.
K8S flex volumes, the first outside-of-the-core solution. It still works.
CSI, container storage interface. 1.0 was released in 2018, it is now the standard.
Now the second part of the presentation storageOS. “Software defined cloud native storage”. It is a containerised project, so there are no other dependencies.
It consists of two parts. Control plane: it manages the actual storage. The data plane manages the volumes (both block and file system).
It normally is deployed as a single light-weight container on every individual node (via a daemonset, for instance). Every container has an API. One of the integrations available for it is k8s’ CSI.
StorageOS creates a pool of storage that spans the entire cluster. An admin will configure/register storage classes. Developers put “volume claims” in their k8s.
As soon as you get a volume in the storage pool, it is available on any node in the entire cluster. This gives you lots of flexibility in moving containers between nodes.
Behind the scenes, it uses synchronous replication beween a primary volume and a user defined number of replicas to protect data from disk or node failure. Nodes can have different numbers/sizes of disks.
They’ve tried to make storageOS usable for a “hyperconverged” environment where every node is used for both storage and calculation. StorageOS will run quite happily on a single CPU and a GB of RAM.
Most people will manage storageOS via k8s, but you can also use the command line or a GUI. For monitoring, they provide lots of prometheus end points.
Some extra features:
Locality, you can get the workload to run on the node where the data is.
There’s encryption at rest. Keys are stored as kubernetes secrets. They advantage is that you have your keys, instead of your cloud provider having the keys to your data.
Sergey works at everon/evbox (https://evbox.com), the host of the meeting.
They knew from day one that they had to run in the cloud, so they were lucky to be cloud-native from the start. They chose Google’s cloud platform then. And in general, it has been working fine for them.
They had a small team originally and didn’t want to “waste” time on infrastructure. They started using Google App Engine. Google at that time used the marketing term “NoOps”, which sounded fine to them :-)
When they switched to kubernetes, it took seven months. That was a bit long. They tried to get buy-in for the process by involving lots of people from most teams. This wasn’t such a good idea (making decisions took a lot of time), it would have been better do it with a smaller ad-hoc team. Another reason for the slow switch was that the company was growing a lot at that time: they needed to get the new developers up to speed at the same time.
Another problem: slow development environments. They used Docker Desktop. That used 25% CPU when idle. Kubernetes just isn’t designed to run on a laptop. (Note: there were some other suggestions, like minikube, from the audience)
A third problem: cluster configuration. Configuring anything within a kubernetes cluster works fine. But once you have to interact with something in the outside world (like some IP ranges), you can run into
Some lessons learned:
Try it with one product first. Only then move on to the rest of your products. You have some initial pain because you have to maintain two infrastructures, but it is worth it.
Spread the knowledge, but focus. Don’t let knowledge-spreading hold your migration back.
Set a scope by prioritizing. Application servers; configuration/scheduling/service mesh; messaging/storage.
Know the cost of a configuration change.
Know if cloud-agnostic is important for you.
Monitoring is important. The rest of the talk is about monitoring.
Monitoring. There’s a lot! Zabbix, prometheus, splunk, nagios, datadog, graphite, etc.
A book he suggests: the art of monitoring . From the same author there’s also “monitoring with prometheus”.
Monitoring: there are lots of sources. Your code, libraries, servers, the OS, your infrastructure, services from your cloud provider, external services, etc. And there are many destinations: storage, visualisation, alerting, diagnostics, automation, etc.
So: make an inventory of what you want to monitor and how you want to use it.
In kubernetes, you additionally want to monitor containers, pods,
nodes and your cluster. There are some extra sources, too: kubelet,
the scheduler and the proxy. Interestingly, there are also more
destinations: scheduler (they’re not that advanced that they need to
customise it, yet), autoscalers (they’re using this), dashboard and
Note: there is no build-in monitoring data storage solution in kubernetes. You’ll need to use something else for that (like prometheus).
What you need to design is a monitoring pipeline:
Some public clouds have their own default monitoring solution. With google, you get “stackdriver”. Amazon: cloudwatch. Azure: monitor. It is relatively cheap and it is preconfigured for the tooling you’re using.
If you don’t want to use such a specific monitoring stack… and if you want an OSS stack… Very common: prometheus (https://prometheus.io/). And for visualisation, grafana.
Prometheus itself is just a monitoring gatherer/forwarder, but there are a several other projects under its umbrella, like TSDB for storing the monitoring data. Also there’s an alert manager. There’s no visualisation, but you can use grafana for that. Prometheus uses a pull model, so you need to provide metrics via endpoints for it to collect. If you need to push metrics, you can configure a “pushgateway” to work around this.
For OSS, you can also look at InfluxData (InfluxDB, telegraf, chronograf, kapacitor).
Open source stacks: they’re cheap. Cloud-agnostic. Highly customizable. A healthy ecosystem. There is still a bit of competition in this area: graphite, ELK, zabbix/nagios.
And…. there are loads of commercial solutions that promise to solve all your monitoring problems. For instance Datadog. Datadog inside kubernetes means installing an agent container on every node. Once collected by datadog, they handle everything else for you.
Commercial solutions: they cost you a lot of money. But they’re often quick to configure! So if you have the money to spend, you can get up and running with pretty good monitoring real quick. There’s lots of competition in this area. Lots of companies offering this kind of service.
There was a question about logging. He answered that google’s stackdriver is working quite OK here. If they move to OSS, they’ll probably use prometheus for monitoring and an ELK stack for logging. Doing the monitoring inside ELK, too, wouldn’t give you good monitoring, he thinks.
Kubernetes 1.16: watch out, some libraries have been deprecated. When deploying a new cluster (for a training) two days after 1.16 came out, for a workshop, with infrastructure as a code, his code broke down. Because Helm and all Helm charts used where broken… He flies close to the sun, by always directly using the latest of the latest, but be aware that the change to 1.16 can be somewhat more bothersome.
Something to look at: Octant, made by vmware. It is a bit like kubernetes dashboard, but works on the client (uses kubectl config file). It visualizes ‘kubectl’. https://github.com/vmware-tanzu/octant
Kapp (https://get-kapp.io/). It is part of https://k14s.io/, “kubernetes tools that follow the unix philosophy to be simple and composable”. Kapp is a bit comparable to ansible, especially in its output. It is a simple depoyment tool, focused on the concept of a “kubernetes application”.
Tibo works as lead developer/devops guy on the https://nu.nl website (very well known in the Netherlands). 12 million hits/day.
Their IT team is growing. It gets impossible to know everything about anything.
Infrastructure provisioning is done with “terrible” (terraform + ansible…) :-) Lots of AWS. Their existing devops practices was on a solid foundation:
All infra is in code.
“Terrible” providing mechanisms for authentication and so.
But…. setting up extra test environments is slow. Terraform has a slow feedback loop (difference beteen plan and apply). Ansible could take 20 minutes. The infra isn’t very scalable (due to needing a reasonably fixed ansible inventory). Config and secrets management becomes problematic.
So they wanted to improve something. But where to start? Lots of items are connected, so it is hard to find a starting point. A trigger point occured early 2018. Kubernetes just got ready for production + they had to start a brand new website…
An advantage of kubernetes is that it is a flexible platform. A platform to run containers on. But also a platform as a means to work on better logging, better separation, better 12factor usage, etc …. Kubernetes is a journey, not a destination.
But they didn’t want to get carried away. Not everything needed to be chopped up into miniscule nanoservices. And not every thing needed to be in kubernetes: using AWS’ managed databases and so was fine.
(He then mentioned the various components they used; I’m not that versed in kubernetes yet, so I couldn’t make a good summary of that part of his talk). For CI/CD they use Jenkins.
Some things that went wrong:
Memory usage. Make sure you set memory limits.
CPU usage. Yes: also set CPU limits. During a certain event, the
servers started to use so much CPU that the core kubernetes
components started to suffer…. Reserving memory and CPU for
kubelet also helps (
Having memory limits can also be a problem. They upgraded a component which started to legitimately use a bit more memory. It hit the memory limit, got killed, started again, got killed, etc….
Apart from these problems (which you can have with other solutions, too), kubernetes runs pretty stable.
They’re looking at improvements. Helm charts. Combined with SOPS
(Secrets OPerationS) by mozilla (It manages AWS API access instead of
keys. It is versatile.) They’re quite happy with Helm. A big advantage
is that your
git diff is much more informative than without helm.
Full title: “shattering worlds in a good way - from Docker to Kubernetes within an international leads company”. Ruben (https://www.n-ableconsultancy.nl) worked with a small international company to move from docker to kubernetes within four months. He provided consultancy and training to help them “navigate” the very diverse kubernetes landscape.
Fun fact: Ruben previously had a company where he build a kind of “kubernetes-light” before kubernetes even existed. Since he’s switched to kubernetes.
Kubernetes is the next step in virtualization. Kubernetes effectively virtualizes all the OSI layers. “It cuts tethers”. Kubernetes aims at applications. It gets your application running, whereever you want to run it.
Kubernetes enables change. It focuses on computing and state at the level of your application. What he means is that even the infrastructure and the provisioning of your app becomes part of the application. Your “span of control” becomes broader.
Roles will grow together that way. It will have a big influence on the IT landscape.
Back to his use case. He got a call because the company’s CTO was about to leave and almost no-one knew how everything fitted together. The first priority: find out, exactly, the current situation. What is there? Why is it there? You need a good inventory. You need to know your starting point. This phase took a month (one day a month…). The design was both brilliant, creative and crazy.
He experimented with kubernetes the weekend before he started working for the company. He suggested it and they OK’ed it. As a proof of concept they installed kubernetes on bare-metal at https://hetzner.de .
On day one of the proof-of-concept Ruben started them on experimenting right away. A deep dive. One of the developers started to turn red: “I don’t know anything about networking, I can’t work on this” etc. He felt insecure and got angry about it. A nice start :-)
But… in the end they got the front-end running in kubernetes, even though the original setup was quite elaborate. At the end of the three weeks, the developer started to get enthousiastic.
Step two was to get it high available. They switched from self-hosted to google (GCE, GKE). Google seemed to be the best for kubernetes hosting.
For every component, they tried multiple solutions. For central monitoring, they went with datadog, for instance.
Step three was becoming complete. Security, authorization, CI/CD (gitlab). And migrating their big data environment to GCE.
Step four: functional and load testing. The company wanted to go live immediately, but he managed to convince them to first make sure everything worked. They redirected some of the existing application’s load over to the new solution, which worked OK.
They did load testing and it turned out that kubernetes could handle 10 times as many requests per second. Testing like that gives you confidence that everything will work just fine.
So: time to go live! They made a mistake and accidentally deleted the entire cluster an hour before going live…. So they stayed on the existing system. One of the developers went home, re-created the cluster, started up the three-hour-mongoDB-sync, went to bed and the next day they actually went live.
Important point: being able, as a developer, to re-create the entire environment was a big booster for his confidence in kubernetes.
Another important point: they now understand their entire environment. And it is much more secure. Before, they had a black box, now they can see what’s going Ona.
Kubernetes forces you to be explicit. It helps you build a better and safer environment.
Kubernetes is a change enable and a game breaker. It gives control and power back to companies that need their IT to be succesfull. It can make or break a business’s competitiveness.
Ruben thinks kubernetes will impact the way we all work with IT and the way way consume IT in the same way Ford’s assembly line changed the auto industry.
Kubernetes has quite some terms you need to get to know. If you want to start explaining it to colleagues, you need to watch out not to innundate them with terms.
kubectl is a command line interface for running commands against
kubernetes clusters: create, delete, describe, get, edit, logs, etc. Kubectl
allows you to create simple custom scripts, which is handy to get your
minikube emulates a kubernetes environment inside a VM (virtualbox, vmware,
etc). It is handy to get you going.
helm is basically the package manager for kubernetes. A “helm chart” is a
set of instructions to install something. Important: the
file. Its contents are shown to the user after installing the package.
skaffold is a command line tool for continuous development.
telepresence is a tool to create your development environment inside a
production cluster. This really speeds up development.
When you want colleagues to create helm charts, it helps to script it a bit so that the boilerplate is already generated for them.
We have our 24 hour hamster wheel of work. Homo econonomicus. Getting things done.
There’s a cloud gazers society: they just look at clouds. No, not the IT clouds, but the real clouds outside in the sky :-)
Look at clouds. Be lazy. Lazy time is time saved up for later :-)
A fast and successful workflow with failures and nothing to be ashamed of.
We start with accountability. Often humans get the blame, for instance in accidents with ships or planes. But why are machines so perfect even though they’re build by humans?!?
Perfection is the killer of any good.
If you design a system: design it so that it tolerates failures. Then it will be robust.
Allow yourself and others to fail. Be humble.
If you’re working in a toxic environment, you’ll have to narrow your objectives.
You will get tips like “just do what you have to do”. Don’t do anything extra. And don’t get creative.
Get a mentor, but what you really need is a champion.
Relax: just work as little as possible. That is part of your compensation.
Also grab all the extra’s (like gym memberships).
Powerpoint: a corporate presentation is a regular document that is accidentally printed in landscape.
He has some more here: https://cote.io/books (free in the week of the conference)
Kubernetes have “pre-stop hooks”.
He wrote https://github.com/noamt/stop to make it easy to work with the hook: it can send a signal to any go application.
You cannot lean, agile of devops your way around a bad organisation culture.
Measuring and monitoring everything. Most don’t know what they really want or need to measure.
Multiple work managent tools (email + all the others)
Misalignment of incentives.
Institutional versus tribal knowledge. Knowledge you need, but don’t really have.
Incongruent organisational design. The company isn’t structured in the best way possible.
Managing complexity. Complex systems are often not understood.
Security and compliance, devsecops. Security theater.
Developers: were being told by management to build new features. Operators: were being told by management that everything had to be stable.
Developers are often better connected to the business. And what they’re doing has obviously a measurable effect. On the other hand, ops only got noticed when something broke (even though it might be the developers’ fault).
There was a thick wall between dev and ops.
Devops, in its purest definition, is breaking down the barrier between dev and ops. Slowly it started to extend, for instance to include security. DevSecOps and so.
There are five key areas to devops:
Reduce organisational silos. This is often doable. You could experiment with putting dev and ops in the same room: automatically, they’ll start cooperating.
Accept failure as normal. If you are not allowed to make mistakes, you won’t have much initiative. If you’re job is on the line whenever there is a change, you’ll most likely say “no” if someone requests a change. So: allow failures, that’s much better for your organisation.
Implement gradual change. Release often. Allow for easy rollbacks.
Leverage tooling and automation. Sometimes you hear people say “I installed ansible, so I’m using devops!”. No, that’s not it. The automation is a necessary tool for implementing the other four points.
All five are abstract ideas. They don’t tell you how to do it. Devops defines the endresult, but doesn’t define how to do it.
SRE (site reliability engineering) is an implementation of those five devops ideas:
(I missed this one)
SRE embraces “blameless post-mortems”. And it allows for failures: they’re budgetted.
Small changes: yes. Ideally a database change is in a different change from the code changes.
Automate everything away that we’re doing manually now..
SRE is obsessed with measuring. Especially because of the SL-somethings.
SLI: service level indicator. Often observed by an external system.
SLO: service level objectives. Binding target for SLI.
SLA: service level agreement. Business agreement between a customer ans a provider.
Watch out with the target that you aim for. If your site has 99.9999999% of availability and your users access it through their cellphones…. The availability will look much less. So: don’t go overboard. A high availability costs serious monly: what is your availability budget?
If the availability budget is drained, you cannot deploy new features anymore until the budget is increased again.
SRE looks down upon toil: stuff that is manual, repetitive, automatable, tactical and devoid of long-term value. There’s a budget for reducing toil (and thus increasing automation).
He showed a picture of a relaxed zebra. Imagine a lion suddenly starts chasing the zebra: it activates the fight-or-flight response. Blood presure rises, heart rate goes up, digestive system shuts down, etc.
If the zebra gets caught, the freeze response is triggered. This might confuse the lion, that sometimes moves on to another zebra. If the zebra survives, he starts shaking and returns to normal. He shakes it off, literally.
We humans are different. We have a pre-frontal cortex. It has lots of advantages. There’s also a disadvantage: it re-plays horrors from the past. And our nervous system reacts the same to the re-play as to the original.
Zebras and other shake off life-threathening occurrences on a regular basis. We humans are not so good on that. Our nervous system can start osscilating. Or it can become stuck on “on” or “off”…
Trauma occurs when one’s solution (=active response to threat) does not work.
Trauma can result from both real and perceived threats.
Trauma is subjective and relative.
Organisations can also have traumas. An unexpected outage, for instance. Organisations often react in the same way as humans…
Organisations can be in hyperarousal state. Fight of flight. Military terms get used. Lots of energy is used, which is now unavailable for other pursuits.
Hypo-arousal (“stuck off”). Freeze. We just won’t make any changes.
Watch out for inappropriate responses. The responses we had 10 years ago might not be valid in today’s much more complex cloud environment.
Some homework: see if you can determine your organisation’s “window of tolerance”. Which ups and downs can we handle just fine? Resilient organisatoins are not traumatized by routine threats to their business.
There are ways to cure humans of trauma. Some of them can be applied to organisations. Like practicing during a game day. Then when it happens, you’ll remember the training exercize and it won’t be as bad.
If there is a real incident that turns out not to be so bad: just continue your incident response at the original severity level. This way you’re getting some more exercize. This way it becomes normal.
Watch out for cognitive distortions. He mentions a few:
Polarized thinking. All or nothing.
Fortune telling. We feel that if we have enough data, we can predict the future…
Control fallacies. Either “we have no control” or “we have absolute control”.
Resilient strength is the opposite of helplessness.
He showed a number of the power plugs and adapters that he has to take with him when traveling. Quite a lot. The original power plugs were designed by the acutal power companies.
The cloud industry today is like the power industry in the beginning: basically unregulated and very powerful.
A starting industry needs a “killer app”. For the electrical power industry, it was the light bulb. There was quite a lot of fight on plugs and connectores as the one who controls that has a lot of influence, including lock-in effects. Same with the cloud: every provider has its own tooling and “standards”.
Containers are multi-cloud. They’re a bit like the multi-cloud plug, like a power plug adapter. You avoid lock-in.
Cloud is changing the world. Businesses, software vendors, communities are all involved. He thinks the communities are the most important in the end. We, the community, have to solve problems collaboratively. We are the ones that have to figure out the new standards.
A difference between cloud now and power industry then: around 1900 you had Taylorism. Industrializagion, having just a few decision makers. You’d get standardized tests so that you didn’t need a schooled chemist anymore, but that a regular worker could “just” do a simple test.
The difference he sees in the cloud now is the community. We’re the ones making the technology choices (=kubernetes, for instance). We cooperate and we talk and we see ourselves at conferences like this.
We can design the “multi-cloud”. With containers and kubernetes, we can be cloud-agnostic. As a community we can collaborate on true portability of all workloads. We’ll have to keep asking the vendors and the clouds about this. We should steer this multi-cloud effort.
Multi-cloud is in everyone’s future. Let’s disrupt together!
She had some comments at the start. “The only good diff is a red diff”. “Junior engineers ship features, senior engineers kill features.” :-)
There are three main elements of observability: metrics/monitoring, logs, tracing. Logs are strings, mostly, so mostly useless. Monitoring is lots of data, but it is mostly out of context.
“What got you here won’t get you there”. She mostly means developing and deploying code. Releasing and deploying used to be “flipping a switch”. Now it starts to be more continual. Many more releases (and some rollbacks). And “getting it into production” should be embedded throughout the process. It should move earlier right into the process of writing the code.
What are the markers of a healthy engineering team? A good measurement is “how quickly does your code end up in production?”
If you can get stuff to work with a regular old-fashioned LAMP stack: please do. Monitoring is easy. Many companies have much more elaborate and complex systems. Observability is really a problem. You basically have to deal with unknown unknowns. If your pager goes you often say “hey, that’s new!”…
Distributed systems have an infinitely long list of almost-impossible failure scenarious that make staging environments practically worthless. Lots of things will go wrong in production that are irreproducible in staging. Or they’ll go wrong in one part of your production system and never in the other part…
As a software developer, operational literacy is not nice-to-have, it is essential.
Well, you need observability. Monitoring is not enough. See the monitoring is dead talk (note by Reinout: I hope that is the correct link). Monitoring hasn’t really changed in the last 20 years. It basically can only handle what you can predict beforehand.
Observability is a measure of how well internal states of a system can be observed by looking at it from the outside. So: can you answer new questions by looking at your system’s available outputs? Important point: can you answer those questions without deploying new code?
Complexity is exploding everywhere. Monitoring is intended for a predictable world.
Testing in production: do it. Everyone does it. The bad ones just don’t admit it. She mentioned something about seniour engineers: you trust their instincts. If a senior dev says “I have a bad feeling about this”, you stop and investigate. So you want the senior devs (and all others) to hone their experience on the actual production systems. If you “train” them on the staging systems, you’re training them on wrong data.
Three principles of software ownership:
They who write the code …
… can and should deploy their code …
… and watch it run in production.
You need to develop a good feel for it. Don’t only look at it when it breaks, but observe it regularly. See if it behaves in the way you expect it to. You need experience! “Muscle memory”.
What you want to accomplish is that problems are corrected and bugs are fixed before the customers find out about it. You want to build a well-behaved system. It is not just OPS that should be on call in the middle of the night. DEV should be, too. Only then will you write well-behaved code.
As a dev, you should spend more time observer your real system. Spend less of it in your artificial environment.
Software should be/have:
tested in production.
Statistics: charts of posts per year and per month.
My name is Reinout van Rees and I work a lot with Python (programming language) and Django (website framework). I live in The Netherlands and I'm happily married to Annie van Rees-Kooiman.
Most of my website content is in my weblog. You can keep up to date by subscribing to the automatic feeds (for instance with Google reader):