Reinout van Rees’ weblog

Devopsdays 2020: can you keep a secret? - Sarah Polan

2020-07-09

Tags: devopsdays

(One of my summaries of a talk at the 2020 online devopsdays Amsterdam conference).

Three people can keep a secret if two are dead.

Secrets? Private certificate keys, username/password, etc. You need to keep them secret in order to protect assets. Personal info, medicial info, financial, system configs, protecting your servers from cryptomining, etc. Also logs need to be protected (otherwise you’re handing intruders a map).

  • What needs to consume the secret? One service? Multiple ones?

  • How is the secret stored? Keepass? Source code?

  • How is the secret rotated? How often?

  • How do you initially generate the secret? How do you keep it secret?

  • How do you keep humans from seeing machine secrets?

  • If you have multiple systems (cloud…), how do you move the secret?

Problems: lots of secrets all over the place. There’s often no centralised solution. Secrets can be hardcoded in software. Default passwords (equifax’s admin:admin…).

Luckily, there are tools for that. But first, we need to look at policies. Take an inventory of what you’re doing now. Which secrets do you have? How are they stored? Who has access to what?

Principle of least privilege. Only give the bare minimum of privileges that allows the user to complete the job.

She then did a demo with hashicorp vault and kubenetes. Also there, the principle of least privilege is important. Who needs access to which secrets? Write access? Read access?

Devopsdays 2020: Five ways to screw up your devops transformation - Dave van Herpen

2020-07-09

Tags: devopsdays

(One of my summaries of a talk at the 2020 online devopsdays Amsterdam conference).

Organisations engage in devops transformations as we are more happy ourselves that way and our customers also. Better products. We ought to become faster, smarter and safer.

Devops gained momentum by adopting and integrating several existing and new practices like lean, agile, cloud, CI/CD, SRE. Success lies in combining this in a healthy balance: people, process, technology.

How to do this in your company depends on your company. But one thing valid for all: we can learn from mistakes. So here are five ways to screw up.

  • Fail #5: keep calling it a devops transformation. Watch out for using such buzzwords. Don’t just throw in “kubernetes” and expect everything to improve. You need to know the reason, the intention behind the term/technology/buzzword.

    Watch out with “labeling”. It is often short-term. You start with agile, then it is called scrum, then it is called devops, etc. Your organisation quickly gets weary of such labels.

  • Fail #4: just gantt it. If you think such a transformation can be done with a nice linear project plan: no.

    What you need to look at: CYNEFIN framework. So: stop using complicated tools to deal with complex problems. CYNEFIN differentiates between complex/complicated/chaotic/obvious.

  • Fail #3: keep calm and let the agile coach take care of it. So: don’t open a can of agile coaches. Agile is only one aspect of devops, so by hiring one kind of coach, you’re neglecting the others. (Note: there are good coaches, but watch out.)

  • Fail #2: stick with fluffy culture stuff. Yes, culture is important. But you shouldn’t look at it at too high a level. If you start monitoring, you want specific useful measurements, not generic graphs. The same with culture.

    Dive deeper with culture, too. Look at specific behaviours like feedback or transparency or prioritisation instead of a lofty high-level goal.

  • Fail #1: allow fake change. Don’t fake devops for management when there’s no real change. Don’t leave the existing organisational structure in place: you have to change it. If the structure stays the same, the results will also stay the same.

    Look at what management is doing. What do they say and what to they do? Does what they do match what they say? The Germans have a nice word for it: Eisenbahnscheinbewegung. A “railway fake movement”. The feeling you can get in a station that your train is moving when it actually the train next to your window that’s moving.

    Call out fake change when you see it!

Devopsdays 2020: moving from project to product - Edward Pearson

2020-07-09

Tags: devopsdays

(One of my summaries of a talk at the 2020 online devopsdays Amsterdam conference).

The full title of the talk: moving from project to product: it’s time we stopped failing the business. Link to his slides.

Where did it all go wrong? In many organisations, IT is not fit for purpose. IT is failing the business. No: we are failing our business. “Our” is also potentially a problem when work has been outsourced…

In companies, often new features are quite big. Someone gets a good idea, gets management behind it, something is architected, UX is involved, frontend and backend devs get to work. And at the end it has to be put on the servers and maintained. And it has to start providing value to the customer.

Such a project starts to go wrong right at the start: where does the idea come from? Does it start with an actual customer need? Once the project gets underway, as described in the previous paragraph, it looks more like the 20th century scientific way of managing work. There are those that plan and those that do. There is one best way to do any single job. Workers need to be managed and monitored. They need to be trained to do their specific job well.

  • Projects assume we know what the right thing to build is. This means we make key decisions right at the start, exactly when we have the least information.

  • Project have to be funded. So: a pre-defined solution exists. Teams aren’t really expected to learn and adjust.

  • Project teams are focused on short-term goals. Long-term maintenance is someone else’s problem.

  • Projects focus on output, not customer value. Output: cost, scope, quality, time.

So: projects do not focus on customer value.

The future lies in products, platforms and services. Those are much more focused on customer value. Automatically we often get more cross-functional teams.

“Platform as a service” is important as an underpinning. To prevent hand-overs between teams, as much as possible needs to be provided as a service. Hosting platform, continuous delivere platform, monitoring service.

How does the product workflow go?

  • You start with an idea, which ends up on a prioritised backlog. The customer can help with the prioritisation!

  • When it gets out of the backlog, there’s commitment to actually create it.

  • It is created by a cross-functional team. This might involve the customer.

  • The value ends up quickly with the customer.

Top five guiding principles. What needs to change?

  • Batch size needs to get smaller. You can move faster. And you can release more often. If you release more often, the pressure to get a half-baked feature in is lower.

  • Focus. On actual customer value.

  • Scope. It is agreed in advance and based on feedback from the developers and the customers.

  • Funding. Based on actual customer need and customer value.

  • Speed to market. Speed is “as long as it takes”. As fast as possible. What is the minimum we can do to actually provide value?

What do we measure? Output (project) or outcome (product)? Align your people to deliver what your customers value.

Devopsdays 2020: BizDevOps, bridging the dominant divide between business and IT - Henk van der Schuur

2020-07-09

Tags: devopsdays

(One of my summaries of a talk at the 2020 online devopsdays Amsterdam conference).

The demand for IT has increased a lot. At the same time, there’s democratisation of technology: previously you only had professional developers, now there are also what he calls “citizen developers”. And even “consumers” themselves are starting to do a bit of development (IFTTT, for instance).

IT usage has also changed. Way back in the old days it was mostly internally focused: bookkeeping, HR. It is now focused much more outside the company: it delivers value for the customers.

New technologies allow constant changes for businesses. Philips started out as a lightbulb company and now builds medical equipment. Nintendo started as a physical playing card company.

Business and IT are often separate. IT says to business “trust us and give us freedom, that results in value”. Business says back “trust is hard when there’s downtime and when features are late and don’t match what our customers actually value”.

Devops is a bit in the middle between business and IT, in a sense. What can we do? Well, get the right people together in one room (or online channel…) from througout the organisation. And focus such a meeting on specific actual customer values. We’re in this together.

What you need is bridge builders that can get this going.

Such a meeting or series of meetings can be an intense “pressure cooker” to kick-start a team. “What is the minimum information the inventory manager needs” is the kind of focused question you could ask. It can result in a quick prototype within a week.

  • Preparation: what is it really about? Why are we doing it?

  • Ideate: do we really understand the problem? What are we missing?

  • Sketch: check “is this what you mean?” What should be changed? How does this bring value?

  • Prototype: what are the most important requirements? Who is the audience? What are we going to deliver? What do we leave out?

  • Decide: was value actually deliverd? What’s next?

He uses the term BizDevOps. He sees a role for OPS on the way from business to development. Business has a desire, OPS helps define it and IT can start with a good plan. DevOps is empathy between dev and ops. Can we repeat the “trick” with business included?

Devopsdays 2020: securing your devops transformation - April Edwards

2020-07-09

Tags: devopsdays

(One of my summaries of a talk at the 2020 online devopsdays Amsterdam conference).

She quotes Gartner: “95% of cloud breaches occur due to human errors such as configuration mistakes”. Passwords in repositories, for instance.

A common problem is deadlines. It is easy to cut back a bit on security to meet a business deadline or objective…. Manual processes and culture also hinders security innovation. Interference from management. Fear of failure. Lack of learning.

Devops: people, process, products. Five main things to pay attention to:

  • Enable secure development. What tools do you use?

  • Secure your development devices and development environment.

  • Make sure your source code control is secure.

  • CI/CD tooling and all the other automatic checks you can do.

  • Good alerting and monitoring.

Devopsdays 2020: Microsoft’s not-so-easy evolution into the world of DevOps - Abel Wang

2020-07-09

Tags: devopsdays

(One of my summaries of a talk at the 2020 online devopsdays Amsterdam conference).

For devops he uses a definition of Donovan Brown: devops is the union of people, process and products to enable continuous delivery of value to our end users.

Note: delivery of value, not per se delivery of code or so. Delivering code is painful as things can break. But… if it hurts, do it more often. Find what hurts and keep getting little bit better at it every time. Incremental improvements.

He works at microsoft, so remember they went from shipping boxes with software (every three years or so) to continous deployment… One of the things they changed was to put their best engineers to work on engineering systems. With those systems, they could then build everything they wanted on top of it. This became azure devops.

A core idea: use your own systems. Azure devops is not only for outside users, it is also used everywhere internally. So azure devops problems get discovered and fixed quickly. And it improves continually.

One of their problems was the organisation chart. You had different teams for program management, development and testing. So bugs turned into shouting matches between devs and QA. That slowed everything down. Dev and QA got combined into an engineering team with shared responsibility. That took some re-training and some getting used to. They even lost 20% of the employees as they couldn’t handle it or didn’t want to do it. In the end, it was worth it.

They have now moved on to feature teams. Everything from UI to database to deployment is handled by one team. This way you can handle customer requests and interaction better. And you can get everything done more quickly. Feature teams don’t exist for ever, so as engineer you have the change to try something else from time to time.

They work in sprints of three weeks. Sprint and “quarters” (=4 sprints) are the responsibility of the teams. This deals with the details. The leadership is responsible for the big picture: “semester” (6 months) and “strategy” (12 months). Leadership is even forbidden from looking at sprint backlogs in order to prevent micro-managing!

They don’t want to incur debt. There is a rule: fix most of the bugs within the spint. There’s a little bit of leeway in that a team can take 4 x amount of programmers bugs over to the next sprint. But the amount isn’t ever allowed to be higher. This way, the total number of bugs in the whole system stays manageable. It is a huge difference compared to the longer development cycles they had earlier: a huge stack of bugs could accumulate that was near-impossible to chop down to size again.

So: continuous improvements! It took them a long time.

Devopsdays 2020: importance of devops in the new world - Michael Ducy

2020-07-09

Tags: devopsdays

(One of my summaries of a talk at the 2020 online devopsdays Amsterdam conference).

(Note: he woke up at 3:00 in the night to give the talk as he’s from the USA).

Corona… social distancing, less travel, canceled conferences, communities shattered when we need them most.

The rise of cloud and mobile has changed the world. Your mobile phone has access to everything. There’s the “rise of the connected consumer”: businesses have changed a lot, too. Digital transformation. DevOps enables it.

DevOps is a lot of things. But mostly it is empathy. So community is important (and sadly mostly lacking at the moment). Digital transformation is about other things: capitalism (not necessarily intended negatively), consumption, individuality.

Individuality? In the corona response you see a mostly individualistic approach in the USA and much more community approaches in countries where they have it more or less under control.

So: what does digital transformation do to serve the people? What kind of world do we want to build? What is our focus? What does digital transformation do to serve the underpriviliged and the poor? It is the kind of thing you see when you need to get an unemployment benefit system from 100 “customers” per day to 180.000 per day…

Looking at the real world with a technical eye, our society is a legacy system that needs upgrades desparately. At least in the USA (where he’s from) parts of the system are 400 years old, back when there was still slavery.

What we need to focus on is to better the lot of those in our society that have the least. A rising tide lifts all boats. And at least in the USA, more and more money is being poured into security as opposed to welfare: that’s not the best way. we must destroy our, sometimes very old, systems of (racial) oppression.

Rotterdam python meetup

2020-03-05

Tags: python, django

Microservices with Python for AI in radiology - Coert Metz

In radiology, people take a long time to become experienced. Medical school, MD, certified radiologist… And when they’re 68 they’re off to a pension. What they did at Quantib was to try and “scale radiology experience with AI”.

Detection and classification of prostate lesions. Same with breast MRIs. Brain shrinkage. They hope it increases the amount of MRI scans that can be processed. And also the quality of the analysis.

He demoed the application. There’s detection of brain regions in the software, for instance. When you compare two MRI scans at different points in time, you can see the difference and compare that difference with what you would see in a healthy person.

Hospital practice often means downloading radiology RMI images from a central hospital image storage server (“PACS”), taking them to a separate workstation for analysis and then going back with reports. This takes time, so it is sometimes omitted due to time pressure…

What they’re working on now is to run their AI software on a server and connect it to the image storage service. They designed their software as a bunch of microservices. Storage service, import, dispatch, workflow service, processing.

Nice idea: you can add exporter plugins to the system by means of docker containers.

Why microservices?

  • Better scalable. AI on GPU nodes can be expensive. So it is more cost effective to only have to scale those AI services there and use regular nodes for the rest.

  • Cloud-ready.

  • It is easier to reason about a separate service in isolation. Failure modes and security is easier to figure out. And, important for a hospital, regulatory requirements are better manageable: risk management, cybersecurity.

  • Of course, testing in isolation is easier.

Microservices are a bit harder to setup than a monolith. Especially when a large part of the team isn’t really experienced with devops type of work.

The core services and the front end are down with python and django. The services also mostly use django restframework. All the communication between the services is done with REST APIs. Extensions also talk to the APIs. Django restframework is mostly straightforward to use.

When designing an API, make it a nice clean clear consistent REST API. Follow REST good practices. Plural nouns (workflow/workflows). Use HTTP verbs (get/put/post/delete). If resources are nested, also nest them in the URLs. A puzzle: using the right HTTP status codes. There are nice decision trees available for that online. Don’t compromise!

The front-end consists of a separate django app that communicates with the back-end microservices. The user interface is done in javascript.

Testing: regular unittests plus django’s test cases. Javascript: jest (they have about 90% coverage). For integration testing they use PACT-python (consumer driven contracts). It is all done automatically on the continuous integration server. Getting the integration tests to work well was a challenge, btw. What helped was to return only minimal responses when mocking.

Deployment: docker in swarm mode (they’ll move to kubernates later). Docker secrets. Gunicorn+nginx. TLS everywhere: both ways between services. Regular single-way between the browser and the front-end service.

Home Automation and python - Bas Nijholt

Bas likes using programming in his life. For instance home automation: https://github.com/basnijholt/home-assistant-config

He didn’t care about home automation until he found a way to do it in python (home assistant) and he had a good use case. The use case was the elaborate video/audio system of a family member where they were moving in. It should not take six different buttons to finally get the TV running. Time to automate it.

Home automation is an expensive and time consuming hobby (“if it doesn’t cost time and if it doesn’t cost money, it is no hobby”). Changing lights. Turning heating on or off. When you go to the bathroom at night after you’ve been sleeping, don’t turn on the bright light in the toilet, but use a soft red light. Controlling the robot vacuum cleaner to only do its work when everyone is out of the house. Using a smart current meter connected to the washing machine that sends a message to your phone when it is ready. A packet sniffer between the regular thermostat and the heater to intercept and control it. A humidity sensor in the bathroom to detect when you’re showering: then the lights should stay on despite there being almost no movement :-)

Home automation should be fun and (mostly) useful. It should not invade your privacy or complicate your life.

Regarding complication, two things to keep in mind from the python philosophy:

  • If the implementation is hard to explain, it is a bad idea.

  • If the implementation is easy to explain, it may be a good idea.

So: home assistant. The big problem that it solves is that it ties everything together: all the various protocols (wifi, bluetooth, infrared, etc), all the various devices (temperature, humidity, switches, cameras, sockets, etc) and all the various companies… It is written in python. You have abstract “Devices” classes that can be subclassed. And there are lots of examples.

It is open source. Really open source, as it is in the top 10 github projects when you look at the number of contributors. There are lots of active developers. There are even four full time developers paid for by home assistant users!

He then showed his dashboard… A list of plants with their humidity level, for instance. Energy usage. Which lights were on or off. He sent his robot vacuum to a certain room through the web interface. He also showed a video he recorded: nice!

To start with, a raspberry pi and some sensors is enough. Probably you already have a few devices in home already that you can connect.

Detecting outages at scale - Sander van de Graaf

Sander works at down detector. A service that detects when something is down. They monitor loads of services (facebook, etc). Often they notice it earlier than the actual service itself.

They make most of their money from enterprise subscriptions that use it to monitor their own services and also the services they in turn depend on.

They’re using python and django and started in 2012. They initially used python-nltk to scrape twitter messages to determine if there was an outage for a certain service.

They started on physical servers (which he hates, as they tend to die sometimes), then moved to AWS and they’re now using serverless a lot. For serverless they switched parts from django to flask. Django is now used for database migrations and the admin, mostly.

Basically: async everything. A server creates jobs in redis, workers get jobs. A separate service monitors the queue size and increases and decreases the number of workers.

They use python RQ, “easy job queues for python”, which works with redis. He is really enthousiastic about it. It is really simple to use.

He then explained their setup, which uses loads of amazon services. A question from the audience was “don’t you have extreme lock-in this way?” His answer was: “if you use the cloud, go all-in”. If you can only use a small subset because you might want to move to a different cloud provider, you’re missing out on a lots of stuff. You ought to just use a regular virtual server, then. Much cheaper. If you have the money to use the cloud, go all in. Use all the nice tools and all the managed services.

What they also like: python’s @lru_cache cache decorator. Also: “black” for code formatting. Flask. Pipenv. https://codecov.io. statsd. Grafanacloud.

Personal Github projects - Ambar

He quicky showed some personal projects at https://github.com/ambardas .

Based on the book “deep work”, he wrote https://github.com/ambardas/make_my_day_planner to re-schedule his google calender a bit.

In between he showed how to use pytest, doctests and coverage. And github actions to automatically run it on github. Note: quite a lot of audience members mentioned that they like github actions, especially the speed.

Fun: https://github.com/ambardas/sorting_performance (currently, look in the development branch . A small project to determine the optimal on-a-table sorting process for supermarket footbal cards. You can optimize for speed or for you-can-do-it-while-doing-other-things.

See https://visualgo.net/bn/sorting for nice visualisations.

Github basic auth deprecation and jenkins

2020-01-10

Tags: django, python

I have been getting periodic deprecation notice emails from github for the last few months:

Hi @nenskins,

You recently used a password to access an endpoint through the GitHub API using okhttp/2.7.5. We will deprecate basic authentication using password to this endpoint soon:

https://api.github.com/

We recommend using a personal access token (PAT) with the appropriate scope to access this endpoint instead. Visit https://github.com/settings/tokens for more information.

Thanks, The GitHub Team

Hm, that @nenskins user, that is our old jenkins instance talking to github somehow. Apparently through basic auth. Only… where? Most of the github traffic seemed to use just an access token. Jenkins calls that the secret text type. Basic auth is type username with password in jenkins.

What it turned out to be was the github branch source plugin. This periodically looks at our github organisation to see if there are new projects or new branches that it missed. Normally github tells our jenkins when there’s a new project or pull request or so.

Ok, on to the jenkins settings for my organisation. The confusing thing here is that the “credentials” setting says this:

Note that only "username with password" credentials are
supported. Existing credentials of other kinds will be filtered out. This
is because jenkins exercises GitHub API, and this last one does not
support other ways of authentication.

Huh? Github is refusing user/password basic auth, which is what this plugin only supports? I updated every plugin, but the problem still persisted.

I only got it after reading this bug report and especially this comment:

Isn’t that message saying that you can continue to use basic auth so long as instead of using your actual password you use a personal access token. Generate a personal access token from the GitHub “Settings” page and store that personal access token in the Jenkins username / password credential as the password. Place your username as the username. Check that it works. It has been working that way for me.

Ah! So “github is refusing user/password basic auth” really means “github is refusing user/password basic auth”. Using an access token instead of your password is actually fine.

The info in jenkins on those credentials actually mention that somewhat:

If your organization contains private repositories, then you need to
specify a credential from an user who have access to those
repositories. This is done by creating a "username with password"
credential where the password is GitHub personal access tokens. The
necessary scope is "repo".

So I visited https://github.com/settings/tokens and generated a new token with full “repo” rights (this is actually quite restricted in scope, despite the name).

In Jenkins I added a new global username/password credential with the github username + the access token and hurray, everything worked again.

October Kubernetes meetup (NL)

2019-10-03

Tags: kubernetes

I attended the kubernetes meetup in Amsterdam on 2019-10-02. Here are my summaries of the talks :-)

Cloud native storage and StorageOS - Alex Chircop

Alex is both a founder of StorageOS and a co-chair of the CNCF storage SIG. So he’s got two hats. More details on the SIG: https://github.com/cncf/sig-storage

Why is storage important? Well, there’s no such thing as a stateless architecture, in the end. So something needs to be stored somewhere. Containers are nicely portable, but if the storage it needs isn’t portable…. That’s why it is important.

The SIG wrote a summary of the storage landscape: https://github.com/cncf/sig-storage . Normally, you had to work with whatever storage your company’s IT department was using. Now developers get a say in it.

Storage has attributes: availability, performance, scalability, consistency, durability. But they can mean different things to different people. Performance might mean “throughput” but also “latency”, for instance.

You can categorize storage solutions: hardware, software, cloud services. “Software” then means “software-defined storage on commodity hardware” and often tries to “scale out”. “Hardware” is much more specialized and tries to “scale up”.

Another categorization: access via volumes (blocks, filesystem) and access via an API (like object stores). Kubernetes mostly deals with the volumes kind.

Data access: file system, block, object store. All of them are better/worse suited for different tasks. You won’t use an object store for low-latency work, for instance.

A big differentiator: storage topology. Centralised, distributed, sharded, hyperconverged. “Centralised” often means proprietary hardware. “Distributed” often uses a shared-nothing architecture with regular hardware. “Sharded” is often good at spreading your load, but it can be very tricky to get right. “Hyperconverged” means that nodes are used for both storage and computing.

Another aspect: data protection. RAID and mirrors for local disks. Or replicas of entire nodes. Erasure coding: quite extreme distribution, that’s why amazon’s s3 can claim six 9’s of durability.

Kubernetes has the CRI (the Runtime interface) and the CNI (network). It now has a CSI: storage. Kubernetes is a container orchestration solution, so it really also needs to talk to the storage layer.

How k8s progressed:

  • K8S native drivers: hard to debug and update.

  • Docker volume driver interface.

  • K8S flex volumes, the first outside-of-the-core solution. It still works.

  • CSI, container storage interface. 1.0 was released in 2018, it is now the standard.

Now the second part of the presentation storageOS. “Software defined cloud native storage”. It is a containerised project, so there are no other dependencies.

It consists of two parts. Control plane: it manages the actual storage. The data plane manages the volumes (both block and file system).

It normally is deployed as a single light-weight container on every individual node (via a daemonset, for instance). Every container has an API. One of the integrations available for it is k8s’ CSI.

StorageOS creates a pool of storage that spans the entire cluster. An admin will configure/register storage classes. Developers put “volume claims” in their k8s.

As soon as you get a volume in the storage pool, it is available on any node in the entire cluster. This gives you lots of flexibility in moving containers between nodes.

Behind the scenes, it uses synchronous replication beween a primary volume and a user defined number of replicas to protect data from disk or node failure. Nodes can have different numbers/sizes of disks.

They’ve tried to make storageOS usable for a “hyperconverged” environment where every node is used for both storage and calculation. StorageOS will run quite happily on a single CPU and a GB of RAM.

Most people will manage storageOS via k8s, but you can also use the command line or a GUI. For monitoring, they provide lots of prometheus end points.

Some extra features:

  • Locality, you can get the workload to run on the node where the data is.

  • There’s encryption at rest. Keys are stored as kubernetes secrets. They advantage is that you have your keys, instead of your cloud provider having the keys to your data.

Monitoring Kubernetes clusters in Google Kubernetes Engine - Sergey Parhomenko

Sergey works at everon/evbox (https://evbox.com), the host of the meeting.

They knew from day one that they had to run in the cloud, so they were lucky to be cloud-native from the start. They chose Google’s cloud platform then. And in general, it has been working fine for them.

They had a small team originally and didn’t want to “waste” time on infrastructure. They started using Google App Engine. Google at that time used the marketing term “NoOps”, which sounded fine to them :-)

When they switched to kubernetes, it took seven months. That was a bit long. They tried to get buy-in for the process by involving lots of people from most teams. This wasn’t such a good idea (making decisions took a lot of time), it would have been better do it with a smaller ad-hoc team. Another reason for the slow switch was that the company was growing a lot at that time: they needed to get the new developers up to speed at the same time.

Another problem: slow development environments. They used Docker Desktop. That used 25% CPU when idle. Kubernetes just isn’t designed to run on a laptop. (Note: there were some other suggestions, like minikube, from the audience)

A third problem: cluster configuration. Configuring anything within a kubernetes cluster works fine. But once you have to interact with something in the outside world (like some IP ranges), you can run into

Some lessons learned:

  • Try it with one product first. Only then move on to the rest of your products. You have some initial pain because you have to maintain two infrastructures, but it is worth it.

  • Spread the knowledge, but focus. Don’t let knowledge-spreading hold your migration back.

  • Set a scope by prioritizing. Application servers; configuration/scheduling/service mesh; messaging/storage.

  • Use terraform!

  • Know the cost of a configuration change.

  • Know if cloud-agnostic is important for you.

  • Monitoring is important. The rest of the talk is about monitoring.

Monitoring. There’s a lot! Zabbix, prometheus, splunk, nagios, datadog, graphite, etc.

A book he suggests: the art of monitoring . From the same author there’s also “monitoring with prometheus”.

Monitoring: there are lots of sources. Your code, libraries, servers, the OS, your infrastructure, services from your cloud provider, external services, etc. And there are many destinations: storage, visualisation, alerting, diagnostics, automation, etc.

So: make an inventory of what you want to monitor and how you want to use it.

In kubernetes, you additionally want to monitor containers, pods, nodes and your cluster. There are some extra sources, too: kubelet, the scheduler and the proxy. Interestingly, there are also more destinations: scheduler (they’re not that advanced that they need to customise it, yet), autoscalers (they’re using this), dashboard and kubectl top.

Note: there is no build-in monitoring data storage solution in kubernetes. You’ll need to use something else for that (like prometheus).

What you need to design is a monitoring pipeline:

  • Sources.

  • Agents.

  • Collectors (push/pull).

  • Routers.

  • Proxies.

  • Destinations.

Some public clouds have their own default monitoring solution. With google, you get “stackdriver”. Amazon: cloudwatch. Azure: monitor. It is relatively cheap and it is preconfigured for the tooling you’re using.

If you don’t want to use such a specific monitoring stack… and if you want an OSS stack… Very common: prometheus (https://prometheus.io/). And for visualisation, grafana.

Prometheus itself is just a monitoring gatherer/forwarder, but there are a several other projects under its umbrella, like TSDB for storing the monitoring data. Also there’s an alert manager. There’s no visualisation, but you can use grafana for that. Prometheus uses a pull model, so you need to provide metrics via endpoints for it to collect. If you need to push metrics, you can configure a “pushgateway” to work around this.

For OSS, you can also look at InfluxData (InfluxDB, telegraf, chronograf, kapacitor).

Open source stacks: they’re cheap. Cloud-agnostic. Highly customizable. A healthy ecosystem. There is still a bit of competition in this area: graphite, ELK, zabbix/nagios.

And…. there are loads of commercial solutions that promise to solve all your monitoring problems. For instance Datadog. Datadog inside kubernetes means installing an agent container on every node. Once collected by datadog, they handle everything else for you.

Commercial solutions: they cost you a lot of money. But they’re often quick to configure! So if you have the money to spend, you can get up and running with pretty good monitoring real quick. There’s lots of competition in this area. Lots of companies offering this kind of service.

There was a question about logging. He answered that google’s stackdriver is working quite OK here. If they move to OSS, they’ll probably use prometheus for monitoring and an ELK stack for logging. Doing the monitoring inside ELK, too, wouldn’t give you good monitoring, he thinks.

Lightning talk by Ruben Visser

Kubernetes 1.16: watch out, some libraries have been deprecated. When deploying a new cluster (for a training) two days after 1.16 came out, for a workshop, with infrastructure as a code, his code broke down. Because Helm and all Helm charts used where broken… He flies close to the sun, by always directly using the latest of the latest, but be aware that the change to 1.16 can be somewhat more bothersome.

Something to look at: Octant, made by vmware. It is a bit like kubernetes dashboard, but works on the client (uses kubectl config file). It visualizes ‘kubectl’. https://github.com/vmware-tanzu/octant

Kapp (https://get-kapp.io/). It is part of https://k14s.io/, “kubernetes tools that follow the unix philosophy to be simple and composable”. Kapp is a bit comparable to ansible, especially in its output. It is a simple depoyment tool, focused on the concept of a “kubernetes application”.

 
vanrees.org logo

About me

My name is Reinout van Rees and I work a lot with Python (programming language) and Django (website framework). I live in The Netherlands and I'm happily married to Annie van Rees-Kooiman.

Weblog feeds

Most of my website content is in my weblog. You can keep up to date by subscribing to the automatic feeds (for instance with Google reader):