Tibo works as lead developer/devops guy on the https://nu.nl website (very well known in the Netherlands). 12 million hits/day.
Their IT team is growing. It gets impossible to know everything about anything.
Infrastructure provisioning is done with “terrible” (terraform + ansible…) :-) Lots of AWS. Their existing devops practices was on a solid foundation:
All infra is in code.
“Terrible” providing mechanisms for authentication and so.
But…. setting up extra test environments is slow. Terraform has a slow feedback loop (difference beteen plan and apply). Ansible could take 20 minutes. The infra isn’t very scalable (due to needing a reasonably fixed ansible inventory). Config and secrets management becomes problematic.
So they wanted to improve something. But where to start? Lots of items are connected, so it is hard to find a starting point. A trigger point occured early 2018. Kubernetes just got ready for production + they had to start a brand new website…
An advantage of kubernetes is that it is a flexible platform. A platform to run containers on. But also a platform as a means to work on better logging, better separation, better 12factor usage, etc …. Kubernetes is a journey, not a destination.
But they didn’t want to get carried away. Not everything needed to be chopped up into miniscule nanoservices. And not every thing needed to be in kubernetes: using AWS’ managed databases and so was fine.
(He then mentioned the various components they used; I’m not that versed in kubernetes yet, so I couldn’t make a good summary of that part of his talk). For CI/CD they use Jenkins.
Some things that went wrong:
Memory usage. Make sure you set memory limits.
CPU usage. Yes: also set CPU limits. During a certain event, the
servers started to use so much CPU that the core kubernetes
components started to suffer…. Reserving memory and CPU for
kubelet also helps (
Having memory limits can also be a problem. They upgraded a component which started to legitimately use a bit more memory. It hit the memory limit, got killed, started again, got killed, etc….
Apart from these problems (which you can have with other solutions, too), kubernetes runs pretty stable.
They’re looking at improvements. Helm charts. Combined with SOPS
(Secrets OPerationS) by mozilla (It manages AWS API access instead of
keys. It is versatile.) They’re quite happy with Helm. A big advantage
is that your
git diff is much more informative than without helm.
Full title: “shattering worlds in a good way - from Docker to Kubernetes within an international leads company”. Ruben (https://www.n-ableconsultancy.nl) worked with a small international company to move from docker to kubernetes within four months. He provided consultancy and training to help them “navigate” the very diverse kubernetes landscape.
Fun fact: Ruben previously had a company where he build a kind of “kubernetes-light” before kubernetes even existed. Since he’s switched to kubernetes.
Kubernetes is the next step in virtualization. Kubernetes effectively virtualizes all the OSI layers. “It cuts tethers”. Kubernetes aims at applications. It gets your application running, whereever you want to run it.
Kubernetes enables change. It focuses on computing and state at the level of your application. What he means is that even the infrastructure and the provisioning of your app becomes part of the application. Your “span of control” becomes broader.
Roles will grow together that way. It will have a big influence on the IT landscape.
Back to his use case. He got a call because the company’s CTO was about to leave and almost no-one knew how everything fitted together. The first priority: find out, exactly, the current situation. What is there? Why is it there? You need a good inventory. You need to know your starting point. This phase took a month (one day a month…). The design was both brilliant, creative and crazy.
He experimented with kubernetes the weekend before he started working for the company. He suggested it and they OK’ed it. As a proof of concept they installed kubernetes on bare-metal at https://hetzner.de .
On day one of the proof-of-concept Ruben started them on experimenting right away. A deep dive. One of the developers started to turn red: “I don’t know anything about networking, I can’t work on this” etc. He felt insecure and got angry about it. A nice start :-)
But… in the end they got the front-end running in kubernetes, even though the original setup was quite elaborate. At the end of the three weeks, the developer started to get enthousiastic.
Step two was to get it high available. They switched from self-hosted to google (GCE, GKE). Google seemed to be the best for kubernetes hosting.
For every component, they tried multiple solutions. For central monitoring, they went with datadog, for instance.
Step three was becoming complete. Security, authorization, CI/CD (gitlab). And migrating their big data environment to GCE.
Step four: functional and load testing. The company wanted to go live immediately, but he managed to convince them to first make sure everything worked. They redirected some of the existing application’s load over to the new solution, which worked OK.
They did load testing and it turned out that kubernetes could handle 10 times as many requests per second. Testing like that gives you confidence that everything will work just fine.
So: time to go live! They made a mistake and accidentally deleted the entire cluster an hour before going live…. So they stayed on the existing system. One of the developers went home, re-created the cluster, started up the three-hour-mongoDB-sync, went to bed and the next day they actually went live.
Important point: being able, as a developer, to re-create the entire environment was a big booster for his confidence in kubernetes.
Another important point: they now understand their entire environment. And it is much more secure. Before, they had a black box, now they can see what’s going Ona.
Kubernetes forces you to be explicit. It helps you build a better and safer environment.
Kubernetes is a change enable and a game breaker. It gives control and power back to companies that need their IT to be succesfull. It can make or break a business’s competitiveness.
Ruben thinks kubernetes will impact the way we all work with IT and the way way consume IT in the same way Ford’s assembly line changed the auto industry.
Kubernetes has quite some terms you need to get to know. If you want to start explaining it to colleagues, you need to watch out not to innundate them with terms.
kubectl is a command line interface for running commands against
kubernetes clusters: create, delete, describe, get, edit, logs, etc. Kubectl
allows you to create simple custom scripts, which is handy to get your
minikube emulates a kubernetes environment inside a VM (virtualbox, vmware,
etc). It is handy to get you going.
helm is basically the package manager for kubernetes. A “helm chart” is a
set of instructions to install something. Important: the
file. Its contents are shown to the user after installing the package.
skaffold is a command line tool for continuous development.
telepresence is a tool to create your development environment inside a
production cluster. This really speeds up development.
When you want colleagues to create helm charts, it helps to script it a bit so that the boilerplate is already generated for them.
We have our 24 hour hamster wheel of work. Homo econonomicus. Getting things done.
There’s a cloud gazers society: they just look at clouds. No, not the IT clouds, but the real clouds outside in the sky :-)
Look at clouds. Be lazy. Lazy time is time saved up for later :-)
A fast and successful workflow with failures and nothing to be ashamed of.
We start with accountability. Often humans get the blame, for instance in accidents with ships or planes. But why are machines so perfect even though they’re build by humans?!?
Perfection is the killer of any good.
If you design a system: design it so that it tolerates failures. Then it will be robust.
Allow yourself and others to fail. Be humble.
If you’re working in a toxic environment, you’ll have to narrow your objectives.
You will get tips like “just do what you have to do”. Don’t do anything extra. And don’t get creative.
Get a mentor, but what you really need is a champion.
Relax: just work as little as possible. That is part of your compensation.
Also grab all the extra’s (like gym memberships).
Powerpoint: a corporate presentation is a regular document that is accidentally printed in landscape.
He has some more here: https://cote.io/books (free in the week of the conference)
Kubernetes have “pre-stop hooks”.
He wrote https://github.com/noamt/stop to make it easy to work with the hook: it can send a signal to any go application.
You cannot lean, agile of devops your way around a bad organisation culture.
Measuring and monitoring everything. Most don’t know what they really want or need to measure.
Multiple work managent tools (email + all the others)
Misalignment of incentives.
Institutional versus tribal knowledge. Knowledge you need, but don’t really have.
Incongruent organisational design. The company isn’t structured in the best way possible.
Managing complexity. Complex systems are often not understood.
Security and compliance, devsecops. Security theater.
Developers: were being told by management to build new features. Operators: were being told by management that everything had to be stable.
Developers are often better connected to the business. And what they’re doing has obvious en measurable effect. On the other hand, ops only got noticed when something broke (even though it might be the developers’ fault).
There was a thick wall between dev and ops.
Devops, in its purest definition, is breaking down the barrier between dev and ops. Slowly it started to extend, for instance to include security. DevSecOps and so.
There are five key areas to devops:
Reduce organisational silos. This is often doable. You could experiment with putting dev and ops in the same room: automatically, they’ll start cooperating.
Accept failure as normal. If you are not allowed to make mistakes, you won’t have much initiative. If you’re job is on the line whenever there is a change, you’ll most likely say “no” if someone requests a change. So: allow failures, that’s much better for your organisation.
Implement gradual change. Release often. Allow for easy rollbacks.
Leverage tooling and automation. Sometimes you hear people say “I installed ansible, so I’m using devops!”. No, that’s not it. The automation is a necessary tool for implementing the other four points.
All five are abstract ideas. They don’t tell you how to do it. Devops defines the endresult, but doesn’t define how to do it.
SRE (site reliability engineering) is an implementation of those five devops ideas:
(I missed this one)
SRE embraces *blameless post-mortems”. And it allows for failures: they’re budgetted.
Small changes: yes. Ideally a database change is in a different change from the code changes
Automate everything away that we’re doing manually now..
SRE is obsessed with measuring. Especially because of the SL-somethings.
SLI: service level indicator. Often observed by an external system.
SLO: service level objectives. Binding target for SLI.
SLA: service level agreement. Business agreement between a customer ans a provider.
Watch out with the target that you aim for. If your site has 99.9999999% of availability and your users access it through their cellphones…. The availability will look much less. So: don’t go overboard. A high availability costs serious monly: what is your availability budget?
If the availability budget is drained, you cannot deploy new features anymore until the budget is increased again.
SRE looks down upon toil: stuff that is manual, repetitive, automatable, tactical and devoid of long-term value. There’s a budget for reducing toil (and thus increasing automation).
He showed a picture of a relaxed zebra. Imagine a lion suddenly starts chasing the zebra: it activates the fight-of-flight response. Blood presure rises, heart rate goes up, digestive system shuts down, etc.
If the zebra gets caught, the freeze response is triggered. This might confuse the lion, that sometimes moves on to another zebra. If the zebra survives, he starts shaking and returns to normal. He shakes it off, literally.
We humans are different. We have a pre-frontal cortex. It has lots of advantages. There’s also a disadvantage: it re-plays horrors from the past. And our nervous system reacts the same to the re-play as to the original.
Zebras and other shake off life-threathening occurrences on a regular basis. We humans are not so good on that. Our nervous system can start osscilating. Or it can become stuck on “on” or “off”…
Trauma occurs when one’s solution (=active response to threat) does not work.
Trauma can result from both real and perceived threats.
Trauma is subjective and relative.
Organisations can also have traumas. An unexpected outage, for instance. Organisations often react in the same way as humans…
Organisations can be in hyperarousal state. Fight of flight. Military terms get used. Lots of energy is used, which is now unavailable for other pursuits.
Hypo-arousal (“stuck off”). Freeze. We just won’t make any changes.
Watch out for inappropriate responses. The responses we had 10 years ago might not be valid in today’s much more complex cloud environment.
Some homework: see if you can determine your organisation’s “window of tolerance”. Which ups and downs can we handle just fine? Resilient organisatoins are not traumatized by routine threats to their business.
There are ways to cure humans of trauma. Some of them can be applied to organisations. Like practicing during a game day. Then when it happens, you’ll remember the training exercize and it won’t be as bad.
If there is a real incident that turns out not to be so bad: just continue your incident response at the original severity level. This way you’re getting some more exercize. This way it becomes normal.
Watch out for cognitive distortions. He mentions a few:
Polarized thinking. All or nothing.
Fortune telling. We feel that if we have enough data, we can predict the future…
Control fallacies. Either “we have no control” or “we have absolute control”.
Resilient strength is the opposite of helplessness.
He showed a number of the power plugs and adapters that he has to take with him when traveling. Quite a lot. The original power plugs were designed by the acutal power companies.
The cloud industry today is like the power industry in the beginning: basically unregulated and very powerful.
A starting industry needs a “killer app”. For the electrical power industry, it was the light bulb. There was quite a lot of fight on plugs and connectores as the one who controls that has a lot of influence, including lock-in effects. Same with the cloud: every provider has its own tooling and “standards”.
Containers are multi-cloud. They’re a bit like the multi-cloud plug, like a power plug adapter. You avoid lock-in.
Cloud is changing the world. Businesses, software vendors, communities are all involved. He thinks the communities are the most important in the end. We, the community, have to solve problems collaboratively. We are the ones that have to figure out the new standards.
A difference between cloud now and power industry then: around 1900 you had Taylorism. Industrializagion, having just a few decision makers. You’d get standardized tests so that you didn’t need a schooled chemist anymore, but that a regular worker could “just” do a simple test.
The difference he sees in the cloud now is the community. We’re the ones making the technology choices (=kubernetes, for instance). We cooperate and we talk and we see ourselves at conferences like this.
We can design the “multi-cloud”. With containers and kubernetes, we can be cloud-agnostic. As a community we can collaborate on true portability of all workloads. We’ll have to keep asking the vendors and the clouds about this. We should steer this multi-cloud effort.
Multi-cloud is in everyone’s future. Let’s disrupt together!
The IJ is the river/lake behind the conference building (in Amsterdam). Well, what is it? A river or a lake? Now it looks more like a river, if you look at older maps, it is more of a lake. It changed!
CLAMS - culture, lean, automation, metrics, sharing. You can make a graph out of it with five axis. Map out your organisation! Look at the shape. Is it changing? Does it need to change?
Cloud architects. DevOps engineers. The cloud architects have various tools to make nice overviews and graphical images of the infrastructure they’re planning, for instance cloudcraft.co for amazon.
Devops: everything should be code. So what do we want? Textual formats! Like terraform’s configuration format.
There’s an “export to terraform” function now. It is/uses modules.tf: terraform code generated out of cloudcraft.co’s data
Why do we want yaml?
We want something human-readable, so that rules out xml.
We want comments, so that rules out json.
There are issues, though. Weird conversion of strings.
80:80 becomes the
20:20 becomes the number 20 to the power of
TOML is an alternative that tries to fix several issues, but it isn’t perfect.
If you stick to yaml: quote more than you do now!
He works for a company that helps other companies transition to devops. Most of his clients are mostly interested in tools. “Just install jenkins and docker for us and we’re a devops company”. They’re not interested in “you need to do something about your culture”.
How to improve that? He started thinking about his design background.
Design is all about understanding user needs and creating something that solves the need. Good design also makes things work better. And makes it work nicer.
Design can also reframe problems. Good designers:
… work directly with actual users. They don’t listen to “product owners” or so.
… good designers welcome ambiguity.
… give form to ideas. They don’t just talk about it, but they also try to build it. Whether a physical model (“a ford mustang 1:1 made from clay by the actual designers”) or an electronic prototype.
… co-create in a safe setting. You start welcoming critiques when you’re in a “design studio”, for instance.
… experiment and revise. Everything well-designed today has been improved slowly and incrementally!
He explained the “design thinking” process in one of his devops consultance projects. They talked to the actual users. They did collaborative process map workshops (think: walls full of post-its), instead of thick PDFs and big Visio diagrams.
“Challenge mapping” workshops. Looking at all the brainstormed solutions. Some are really elaborate. But sometimes there are solutions that are so simple and robust that they’re immediately implemented right after the workshop…
Diary studies: just give some employees a diary to write up all they’re doing for a few days. Instead of having the whole company fill out corporate timesheets.
Two take-aways he wants us to remember:
Mindset matters more than background. They did this process with developers that were really not designers. But with a little bit of training they could start using design methods.
Stop and listen. Listen. Really listen. Not because they’re your boss or customer or colleague, but because they’re a human being.
A story on success and imposter syndrome.
He’s working for 15 years now and he still does’t know what he’s doing :-) He started his career by following some 30 courses. His boss aksed him to give a talk about doing those courses, which was well-received.
He stumbled along in his career in that way. He became CTO of a company and had to give a talk on the company’s technical strategy for the next five years. He only worked there for a few weeks…. It went OK.
Almost everybody feels like an imposter sometimes. And the imposter syndrome doesn’t go away with more success, either. He used Mike Cannon-Brookes as an example, who started Atlassian. Oh, Mike is also said to be a world expert on solar energy. The reason? He send a tweet to Elon Musk and got him to do something… That was all…
“Pluralistic ignorance”. Doubting yourself. But everyone doubts. But nobody says it…
You might have an “I am crap” filter through which you filter all external inputs, even the positive ones. Basically you’re programming yourself to be negative about yourself (“NLP”, neuro-lingual programming)
Saying to yourself, over and over, two things that are positive about your. Especially when you start doubting yourself. Do it in front of the mirror if necessary. This way you can re-program your faulty NLP programming.
Don’t compare yourself to others. Especially not to stranges on the internet.
Compliments. Giving a compliment is hard. Receiving a compliment is even harder. Train it. Keep track of the compliments you receive.
Tip: watch the TED talk about the procrastinator’s brain. You can blame the “instant gratification monkey” in your brain instead of the “rational decisionmaker” in your brain.
Learn how others make their sausage. And let others learn from you. That gives you a good perspective on how good or bad others and yourself are. Some things you can do:
Pair programming. How often have you REALLY worked with someone on a deep level?
Pair review (not: peer review, pair review is better!)
Celebrate your failure.
If you have imposter syndrome:
Participate in conversations at social events. Don’t leave right after the conference because you are afraid of
Do public speaking. Start small.
Have an opinion.
Talk about non-work-related stuff. About your hobby. So: have a hobby. Make mistakes in your hobby. Own your mistakes. It is not your work, so it is totally fine to break something or to burn your food or to make a mistake. And it is great to talk about at social events!
A hobby: something you can f*ck up without consequences.
How to identify your limiting beliefs: for that he has another TED talk tip. vulnerability is our most accurate measurement of courage.
She had some comments at the start. “The only good diff is a red diff”. “Junior engineers ship features, senior engineers kill features.” :-)
There are three main elements of observability: metrics/monitoring, logs, tracing. Logs are strings, mostly, so mostly useless. Monitoring is lots of data, but it is mostly out of context.
“What got you here won’t get you there”. She mostly means developing and deploying code. Releasing and deploying used to be “flipping a switch”. Now it starts to be more continual. Many more releases (and some rollbacks). And “getting it into production” should be embedded throughout the process. It should move earlier right into the process of writing the code.
What are the markers of a healthy engineering team? A good measurement is “how quickly does your code end up in production?”
If you can get stuff to work with a regular old-fashioned LAMP stack: please do. Monitoring is easy. Many companies have much more elaborate and complex systems. Observability is really a problem. You basically have to deal with unknown unknowns. If your pager goes you often say “hey, that’s new!”…
Distributed systems have an infinitely long list of almost-impossible failure scenarious that make staging environments practically worthless. Lots of things will go wrong in production that are irreproducible in staging. Or they’ll go wrong in one part of your production system and never in the other part…
As a software developer, operational literacy is not nice-to-have, it is essential.
Well, you need observability. Monitoring is not enough. See the monitoring is dead talk (note by Reinout: I hope that is the correct link). Monitoring hasn’t really changed in the last 20 years. It basically can only handle what you can predict beforehand.
Observability is a measure of how well internal states of a system can be observed by looking at it from the outside. So: can you answer new questions by looking at your system’s available outputs? Important point: can you answer those questions without deploying new code?
Complexity is exploding everywhere. Monitoring is intended for a predictable world.
Testing in production: do it. Everyone does it. The bad ones just don’t admit it. She mentioned something about seniour engineers: you trust their instincts. If a senior dev says “I have a bad feeling about this”, you stop and investigate. So you want the senior devs (and all others) to hone their experience on the actual production systems. If you “train” them on the staging systems, you’re training them on wrong data.
Three principles of software ownership:
They who write the code …
… can and should deploy their code …
… and watch it run in production.
You need to develop a good feel for it. Don’t only look at it when it breaks, but observe it regularly. See if it behaves in the way you expect it to. You need experience! “Muscle memory”.
What you want to accomplish is that problems are corrected and bugs are fixed before the customers find out about it. You want to build a well-behaved system. It is not just OPS that should be on call in the middle of the night. DEV should be, too. Only then will you write well-behaved code.
As a dev, you should spend more time observer your real system. Spend less of it in your artificial environment.
Software should be/have:
tested in production.
Statistics: charts of posts per year and per month.
My name is Reinout van Rees and I work a lot with Python (programming language) and Django (website framework). I live in The Netherlands and I'm happily married to Annie van Rees-Kooiman.
Most of my website content is in my weblog. You can keep up to date by subscribing to the automatic feeds (for instance with Google reader):