Reinout van Rees’ weblog

Headaches migrating an old postgis database server

2023-01-05

Tags: nelenschuurmans, python

Years ago, a database server was installed. Ubuntu 14.04 with postgres 9.3 and postgis 2.1, the standard stuff for 14.04. Over the years, some 450 databases were created by our consultants for various geo-water-related projects.

14.04, so it should have been updated at least four years ago, but we never got around to doing it. It was “only” an internal server, so no priority. Last week, we created a new server and wanted to do a simple pg_dumpall on the old and a psql < all.dump on the new one. Nothing fancy.

It was a big 250 GB sql dump file. Ideally, you’d remove unused databases beforehand. But how? It has been in use for eight years. Cleaning up isn’t our company’s biggest strength. We did send out an email “what can be removed?” but almost nobody answers.

  • Colleagues who made databases seven years ago and don’t work at the company don’t even receive the email :-)

  • If you don’t answer, the database is probably just left alone.

  • “It might be useful later”.

Well, so a quick pg_dumpall was done. But the restore failed. Then I got called in to see if I could figure something out. It took a couple of days before I got something working as it was quite hard to figure out what was actually going wrong.

When restoring the sql dump with psql, it would fail halfway like this:

ERROR:  relation "raster.crop" does not exist
STATEMENT:  COPY raster.crop (rid, rast, filename) FROM stdin;
LOG:  invalid message length
LOG:  disconnection: session time: 0:00:34.943 user=postgres database=work_r0146_zutphen_2d host=[local]

That led me to think it was related to some “crop” table in a “raster” schema that was missing. But it was there in the dump just fine (which was of course a bit of a bummer to extract from a 250 GB textfile…). Looking a bit better at the error output, there were quite some errors like that:

CREATE TABLE
ALTER TABLE
CREATE TABLE
ALTER TABLE
CREATE SEQUENCE
ALTER TABLE
ALTER SEQUENCE
psql:/dump/db.out:51310622: ERROR:  type "public.raster" does not exist
LINE 3:     rast public.raster,
                 ^
psql:/dump/db.out:51310625: ERROR:  relation "raster.crop" does not exist
CREATE SEQUENCE
ALTER TABLE
psql:/dump/db.out:51310645: ERROR:  relation "raster.crop" does not exist
psql:/dump/db.out:51310667: ERROR:  type "public.raster" does not exist
LINE 3:     rast public.raster,
                 ^
psql:/dump/db.out:51310670: ERROR:  relation "raster.dem" does not exist
CREATE SEQUENCE
ALTER TABLE
psql:/dump/db.out:51310690: ERROR:  relation "raster.dem" does not exist
psql:/dump/db.out:51310712: ERROR:  type "public.raster" does not exist
LINE 3:     rast public.raster,

So something is wrong with rasters. But loads of errors were also with normal postgis data types.

Problem 1: the new server was installed just with postgres, not with postgis. There were of course no installation instructions for the old server, and my colleague who set up the (manually installed) server didn’t know that we always include postgis.

Ok, installing postgis… now there were way fewer errors, though rasters still didn’t seem to work. I was still baffled by the sudded death halfway of psql:

ERROR:  relation "raster.crop" does not exist
STATEMENT:  COPY raster.crop (rid, rast, filename) FROM stdin;
LOG:  invalid message length
LOG:  disconnection: session time: 0:00:34.943 user=postgres database=work_r0146_zutphen_2d host=[local]

I mean, there were quite some errors, for instance failing row constraints. So why would it fail with a disconnect on this particular error? Let’s look at the logs in /var/logs/postgresql/. Problem 2: empty logfiles????

I spend a while trying to figure it out until I spotted that log_min_messages and log_min_error_statement were set to panic, so effectively off. My colleague did that to keep the main disk from filling up too quickly with a logfile bulging with postgis-is-missing related errors. That was solved in the mean time, so I could set the minimum log level back to a normal value (“warning”).

After crashing again, I looked at the error message in the logfile, which looked a bit different:

psql:/dump/db.out:51646175: ERROR:  relation "raster.crop" does not exist
psql:/dump/db.out:51663992: error: invalid command \.
psql:/dump/db.out:51663999: server closed the connection unexpectedly
    This probably means the server terminated abnormally
    before or while processing the request.
psql:/dump/db.out:51663999: fatal: connection to server was lost

The second line, with invalid command \. was strange. It is the line after a huge number of lines where some ascii-encoded binary geotiff is loaded into a raster. So probably loading the geotiff into the raster fails with invalid message length.

When googling ‘postgres invalid message length’, I basically only found some message from 2003 related to a jdbc driver… When I get so few useful hits, I normally suspect there’s something very basic wrong with what I’m doing. Problem 3: no helpful google results.

The geotiff with the error was the first geotiff getting loaded, so it was probable that loading such rasters was the thing that failed. During lunch, I talked with someone about the problem and he mentioned that there was a breaking change between postgis 2 and 3: previously, the postgis_raster extension was included with the postgis extension, in version 3 you have to enable it manually.

So… The nice create extension postgis statements in the huge sql dump were not enough to also get the postgis_raster extension working. Hence, probably, the error.

In the end, it turned that most of the databases on the database server were more of the “it might be handy someday but it is not strictly necessary” variant. Apparently the source data was often stored somewhere else or could be re-created if needed… I heard this by pure luck during lunch. Ok… so why did I spend so much time on this???

Anyway, I did a dump of just the schemas and grepped a bit to find the databases that used rasters:

$ pg_dumpall --schema-only -v -f /tmp/schemas.dump

There were only three… I deleted them. The alternative was to handle the dumps separately, creating the databases by hand including enabling the postgis_raster extension and then loading the dumps. But it had taken enough time as is and the data had 90% chance of not being used and 99.99% chance of not being desperately needed, so… delete.

With that out of the way, I wanted to make individual dumps with pgdump --format=custom, as the custom format is way smaller than a plain sql dump (and apparently nicer to postgres). Only: pg_dumpall doesn’t support it. So I grabbed a list of the databases and whipped up a quick script:

#!/bin/bash
set -x
# ^^^ This prints every command: nice to monitor the progress.
cd /dump
# ^^^ A temp mount to dump the dumps.
pg_dump --format=custom -f dw_pacific.dump dw_pacific
pg_dump --format=custom -f geul_midden.dump geul_midden
...

Then a similar script to load them all on the other server:

#!/bin/bash
set -x
pg_restore -d postgres --clean --create dw_pacific.dump
pg_restore -d postgres --clean --create geul_midden.dump
...

So…. perhaps this long blog post helps someone to figure out why he’s getting an invalid message length message :-)

Handy tool: pyupgrade

2022-11-30

Tags: python, django, plone

An advantage of open source is that, ideally, you get to work with other programmers from different companies and different backgrounds and different toolchains. They do things differently and so you can pick out the choice bits and start using them yourself.

As an example, z3c.dependencychecker. A small tool I started in 2009 based on someone else’s script. The goal is to look at all the imports in your python code (or apps in django settings files or…) and compare them to the list of requirements in setup.py. What’s missing? What can be removed?

The last few weeks, gforcada did some nice modernization/cleanup. Removing support for older python versions, for instance. And a move from travis-ci to github actions. So I looked at the new action workflow and saw something new. Black, isort, flake8: I know and use them. But pyupgrade? What is that?

That’s how I discovered pyupgrade. Handy tool to upgrade your code to newer python versions. By default, it removes python 2.x stuff that is no longer needed, like # -*- coding: utf-8 -*- at the top of your file. super() calls no longer need to mention the super class. The is not that flake8 tells you to fix. That sort of thing.

pyupgrade --py38-plus upgrades your code to 3.8 (same kind of commands for other versions). So f-strings will be used. assertEqual/assertEquals. Dict comprehensions.

Real handy to quickly clean up your code! Recommended.

Update: Christian Ledermann pointed me at https://github.com/charliermarsh/ruff . Ruff implements a lot of the pyupgrade functionality (and flake8, bugbear, eradicate and lots of the other tools). And it is faaaaaast. I toyed with it in a personal project and it fixed quite some things. The only pyupgrade-related item I missed was the conversion to f-strings. But that will probably be added later.

Django: excel output instead of csv

2022-08-15

Tags: django, plone

Ten years ago I made a time registration system for our company. Yeah, I know. Time registration systems are what everybody writes just for themselves instead of using one of the existing ones. Custom wishes: the same old excuses :-) But it was good fun to build and still is.

At that time it was becoming popular to build the frontend in javascript. So I build it with pure, unashamed django templates. Custom template tags for formatting; formsets; the whole hog.

After ten years, the website is still in heavy use, with only minimal work needed. Well, a bit too minimal: recently I really had to update the site as it was still running on python 3.5 and django 1.8. As it was pretty standard django, updating wasn’t that hard. (I gave a talk about upgrading old django websites with docker recently, which might be interesting).

The old CSV export

Such a time registration system needs to output reports. Lots of different reports. I picked plain *.csv for that, as at the time there wasn’t a handy xls exporter yet. The best one at the time needed you to install openoffice on the webserver, which wasn’t ideal.

Probably due to my zope/plone (python cms) background, I liked subclassing a lot. So I made a base class for csv export:

class CsvResponseMixin(object):

    prepend_lines = []
    header_line = []
    csv_lines = []

    def csv_filename(self):
        .... return filename generated from title or so

    def render_to_response(self, context, **response_kwargs):
        """Return a csv response instead of a rendered template."""
        response = HttpResponse(content_type="text/csv")
        filename = self.csv_filename + ".csv"
        response["Content-Disposition"] = 'attachment; filename="%s"' % filename
        writer = csv.writer(response, delimiter=";")

        for line in self.prepend_lines:
            writer.writerow(line)
        writer.writerow(self.header_line)
        for line in self.csv_lines:
            writer.writerow(line)

        return response

Something like that. Django views that subclassed this only had to have @property methods to return:

  • Optional extra lines at the top for some metadata (the “prepend lines”).

  • The line with column headers.

  • The lines with the actual csv content.

That worked quite well, apart from one problem: excel. And country preferences. Different excel versions (and openoffice/libreoffice, too, for that matter) reacted differently to what I outputted. Whether you installed the Dutch or English version of excel matters in which character is used as a decimal separator. In NL it is a comma, in EN it is a dot. Of course, the two persons that used my website the most both had a different version…

In the end the most reliable variant was to use delimiter=";" and to just output everything with dots for decimals, as is the default. And to tell those two persons to adjust their import settings accordingly.

The new excel export

Like I said, I recently upgraded the website. Ubuntu 18.04 => 22.04, python 3.5 => 3.10, django 1.8 => 3.2. Somewhere in that chain, the output of python’s decimal.Decimal type, that I used for money, changed in the csv output. It used to output 26643.04, but started to output 26643.040000000 instead. I haven’t yet found the changelog entry that explains where the change comes from. If you know, please mail reinout@vanrees.org and I’ll update it here.

The end effect was that excel’s import of the csv (and apple’s Numbers, too) would show the 26643.040000000 as a string instead of as a number. Bah.

With a bit of string formatting, I could have probably fixed it. But I thought there probably was a pretty good xls outputter nowadays: why not make it more comfortable?

Yes: xlsxwriter fit the bill, so I added it to my requirements.

As I already had a base CsvResponseMixin class, it was easy to change every csv output to xls in one go by changing it to an ExcelResponseMixin.. The content type had to change to application/vnd.openxmlformats-officedocument.spreadsheetml.sheet, which is terribly long, but which points at the joys of a bit of standardisation instead of the old lock-in practices.

Python’s csv module has a writer that happily accepts a bunch of .writerow(some_list_of_items), but the xlsx write wants to know the row and column of each cell. Well, there’s an .write_row() method, but that still needs the row number. So it is back to the good old row_number += 1 like I remembered from my Pascal years:

workbook = xlsxwriter.Workbook(response)
worksheet = workbook.add_worksheet()
# See text below, some lines have to be added here.

row_number = 0
for line in self.prepend_lines:
    worksheet.write_row(row_number, 0, line)
    row_number += 1  # yeah, right...
worksheet.write_row(row_number, 0, self.header_line)
row_number += 1
for line in self.excel_lines:
    worksheet.write_row(row_number, 0, line)
    row_number += 1

workbook.close()
return response

This worked like an absolute charm. Only problem was that some of my views returned django model instances. Python’s csv module would just call str() on those, but xlsxwriter would complain that it couldn’t call float() on it. Ok: different default. So I had to configure a mapping from my django model classes to formatter. I added these lines (where I indicated it in the code above):

worksheet.add_write_handler(Group, _django_model_instance_to_string)
worksheet.add_write_handler(Person, _django_model_instance_to_string)
worksheet.add_write_handler(Project, _django_model_instance_to_string)

And the function that does the conversion:

def _django_model_instance_to_string(
        worksheet, row, col, instance, format=None):
    # See https://xlsxwriter.readthedocs.io/working_with_data.html#writing-user-types
    return worksheet.write_string(row, col, str(instance), format)

The advantage of xls export instead of csv

I updated the website with my changes. After 10 minutes I got a success message: “works!”.

With the csv export, I was used to error messages after every tweak. Apparently the xls export is much more robust. So: recommended!

Note: the relevant code (at moment I’m writing this) is available on github.

Addition: I got a message from Chris Adams on github with a note on streaming big responses. With csv, you can do that as it are just lines, but xlsxwriter needs you to have the entire thing ready before you can start returning the response, as an xlsx file is a zipfile. He points at using the “in-memory” mode to speed it up when you have large django querysets.

Rotterdam (NL) June 2022 python meetup summaries

2022-06-16

Tags: python

(Some summaries of a talk at the June 2022 Rotterdam python meetup).

Leveraging Python for secure medical image data transfer to the cloud - Ronald van ‘t Klooster (Quantib)

Usecase: they want to use python to connect a hospital to the cloud to upload radiology images securely. Imagery of patients, so it has to be really secure. And in their case also certified.

Radiology images are normally in the “DICOM” format. There’s a special dicom protocol to transfer images, both for sending and getting images. A computer that implements the protocol is a “dicom node” and can be used by many applications.

The image format is mostly jpeg (actually a set of images: cross-sections of the brain, for instance) plus metadata.

They have an AI program to help diagnose images, with a microservices based architecture using docker+django. It used to run inside the hospital on their network. They wanted to move the AI server to the cloud. But how to get that working while still retaining their certification? This means that the actual certified AI server program cannot be changed.

The solution was to use an open source DIOCM node software called Orthanc. With a nice REST interface, so usable over https. They added one instance inside the network and one in the cloud, next to the AI server.

Dicom node can send/recieve to/from another dicom node, but how to manage it? This is where they used python. A daemon on both sides that monitores the REST api and the local orthanc databases and triggers the necessary “gets” and “sends”. Also the sending/getting from the orthanc node to the actual node where the orthanc is placed next to.

Python helped them a lot. Requests, threading, abstract base classes, logging… everything is there. And with https and client certificates, the security is good. The node on the hospital side only does get and send, so nothing needs to be send to the hospital, as requesting an open port would be very hard :-)

Deployment is with docker in the cloud. In the hospitals, requesting a windows machine is easiest in practice, so they packaged the small orthanc node up as an executable that can be run as a service.

Developing Python apps on Kubernetes - Marcel Koek (Erasmus MC)

They slowly learned how to get a python app to run fine in kubernetes, including getting your local development environment to match the server environment as much as possible. And… making it as comfortable to work with as possible.

He demoed a small app build with fastapi as an example.

When using docker, there’s an extra layer between you and your code. If you change code, you need to rebuild your docker image. Or you need to mount the dir with your code into your running docker. How to manage that? And you also want to make it resemble your production environment.

One step closer to production is to use kubernetes locally. He uses k3d to run a local kubernetes cluster: it runs kubernetes in your local docker daemon! Easy to run. It also includes a local docker registry, which you can push your image to.

For actually deploying the docker, he uses “helm”, kind of a package manager for docker.

What really helps getting development working nice with kubernetes: tilt. Smart rebuilds, live updates, automatic docker rebuilding. His demo “tiltfile” even included an automatic pip install -r requirements if the requirements change.

Note: a tiltfile looks like a python file, but it is a “go” configuration file format…

Build and scale containerized applications on AWS - James Meakin (Radboud UMC)

He’s working on https://grand-challenge.org/ , an open source project for AI and medical imaging. You can submit imagery (“a challenge”) that can then be evaluated with various algorithms. You can test your algorithms this way.

Techs they use: django, celery, vuejs, htmlx, pydicom and many more. And it is containers everywhere with a multi-region deployment on AWS for really low-latency image access from the browser.

The platform is extensible with container images running, for instance, specific algorithms. That’s a security challenge, as they’re written by lots of people. And you’ve got to manage the allowable resources.

In AWs there are lots of ways to run your containers. The three most used ones:

  • AWS app runner. Cheap and easy. Good if your apps have more than 25% idle time.

  • Amazon kubernetes, EKS. For his taste, it is way too complex. “Yaml hell”.

  • Amazon elastic container service, ECS. It is AWS’ opinionated way to run containers at scale. You have much less to understand: it mostly looks and handles like docker-compose.

You can use AWS fargate to run tasks on either EKS or ECS. Their software allows them to use “spot instances”, instances that are cheap because they might get killed.

They use quite some AWS services. Simple queue service, eventbridge, elastic file storage. So they bought into the AWS ecosystem. But, as the actual software is all open source, they can move to other providers with some effort.

Simulating 6-axis industrial robots with Python - Thijs Damsma (CEAD)

He now works for a firm that builds 3d printing robots. 3d stuff that’s up to 40 meter long :-) He’s trying to use python to manage/steer them.

His current experiments are with https://pypi.org/project/roboticstoolbox-python/, which you can use comfortably in a jupyter notebook, including a 3d visualisation. He showed a nice demo.

Edgecase (kubernetes on the edge): afternoon summaries

2022-05-24

Tags: kubernetes, python

(Some summaries of the May 2022 Dutch Edgecase meeting).

Edge: the new data center - Chris Urwin

Kubernetes is appealing on the edge. What every developer wants is a datacenter under their desk. With k8s (and especially k3s) they can.

  • Kubenetes helps a lot by giving you a real, complete environment to manage your workloads. No more custom scripts to monitor whether your docker still runs and restart it if needed. That’s all handled.

  • The challenge about kubernetes: it is big and bloated. The solution for this is k3s which is a full, certified kubernetes distribution but easy and quick to install and maintain.

Edge, what is it? To him, “anything outside of a data center” is edge.

  • Near edge: telco, media, communications. Bigger deploys. Often you see kubernetes installs being offered instead of regular applications.

  • Far edge: retail, banking, education, argriculture, utilities. Separate devices that need managing. K3S shines here (also in combination with “SLE micro”).

K3s is really open source. Build and born in the cloud. The rest of suse Rancher is too. Lightweight cloud-native edge stack, reliable and secure edge inffrastructure. They aim at low maintenance.

That “SLE micro”: OS build for the edge. Enterprise-grade, but especially for edge compute. A lightweight, immutable OS. Build for containers.

Provisioning a Civo private region for edge computing - Dinesh Majrekar

“Turtles all the way down”: they’re using kubernetes to provide kubernetes to customers. They’re trying to use only cloud stuff. They were the first managed k3s server provider. 90 second cluster boot time :-)

They try to be sustainable. In the UK they use biofuel for the emergency generators, for instance. They’re also doing things with “compute powered water heater” solutions: small computers for in individual homes to use exess heat to warm up water.

He showed what they’re doing regarding hardware. Mostly they deal with entire racks. The core is that they want to fully remotely manage such racks. From power-on to fully operational, they now manage it within six hours.

Inspiration usecase: deploying pods to watch pots in greenhouses - Gerrit Tamboer & Sander ter Schure

Dümmen Orange is a company that cross-breeds flowers. It is actually the biggest in the world. They do it in greenhouses world-wide. A greenhouse: a warm and humid environment. What they produce are seedlings for other companies to grow further.

Their main problem: people don’t scale very well. Finding more plant biologists is hard. They’re trying to do more with camera’s and other automatic sensors. The camera images are fed into an artificial intelligence deep learning system that monitor the growth and opening/closing of the flowers. Images taking terabytes of data, so processing in the greenhouse itself is an advantage: at the edge.

Now… how to get the client along on the journey towards kubernetes?

First they did a deployment of the deep learning application to a simple k3s cluster. The application previously took a lot of computing resources on their laptops. Now they could use the kubernetes version without their laptops running hot.

Second, they moved it over to the cloud. Azure in this case as they were already heavily integrated with all things microsoft. One of the things they started using was the azure key vault, as kubernetes’ own solution is kinda basic. And of course they used managed databases and the managed kubernetes service.

Third, they went fully live. In the greenhouses (so: the edge) the customer set up the hardware. Then they used ansible playbooks to deploy k3s, argocd and the deep learning application. A 3 node cluster, btw.

Edgecase (kubernetes on the edge): morning summaries

2022-05-24

Tags: kubernetes, python

(Some summaries of the May 2022 Dutch Edgecase meeting).

Living on the edge - Gerrit Tamboer

You can run kubernetes in a lots of places. Greenhouses, chip making machines, factories, etcetera.

Connectivity. Plugging in a cable is the best case scenario. But perhaps you need VPN. Wifi is challenging. But perhaps also completely airgapped: no direct internet at all.

Security. If you work in the cloud, security is much different from when lots of people could have direct hardware access to your equipment because it is in a factory or so. Are your disks encrypted?

Unpredictability. On the edge, predictability is out the window. The situation can be quite diverse and weird. If you combine that with kubernetes’ complexity: why would you do that? If you combine complexity and unpredictability, you get operational overhead. But that was the situation a few years ago.

Now you have k3s. It changes everything. k3s, whether a node or a master, is a single binary. You can even run single-node clusters. k3s is great and easy. He did a quick demo. A master k3s in amazon and three battery-powered raspberry pi machines spread over the room running k3s.

What they normally use to manage all the sorts of nodes: rancher. Also ArgoCD. The software side is slowly becoming a bit standardised. The hardware is a bit of a problem: basically every customer has different equipment needs. But if you want to productise what you’re offering, you need to standardise a bit more.

Leafcloud: compute sustainability, an edge network for server heat re-use - Thatcher Perkens

We see forest fires, tornados in Germany, melting glaciers, rising sea levels. Can we fix it? No. We can’t undo what has happened and what is already happening. But we can dampen the effect quite a bit if we put in the effort.

Data centers are a big user of energy and thus a big source of global warming. 2.7% of Dutch energy usage is by data centers. Ireland is at 14.4%! Google’s datacenter in the Netherlands is right next door to the biggest coal electricity plant. Sure, most data centers say they use only green energy, but almost nobody gives solid figures… Google is one of the few that publish reasonably useful figures.

One GPU running for a full year has CO2 emissions equal to 23000 km by car. A 8 CPU kubernetes node some 5000 km. Both calculated with the Dutch mix of electricity sources and including the operation and construction of the data center building.

Some tips what you can do about it:

  • Remove unused or underused servers.

  • Build or write more efficient code.

  • Clever caching.

  • Look at the size of your environment. Can you scale down? Review everything periodically.

  • Automatically stop staging/testing environments.

  • Can you improve utilization by using a queue? Run stuff at night?

  • If you run on the edge: turn on power save mode.

  • If you run in the cloud: pick the right region (from a carbon footprint perspective).

  • Choose where you run: in the cloud our on the edge. Which takes less energy? Which is more efficient?

Tip from the audience: vote with your wallet. Energy costs money, so if you pick the cheaper offerings of cloud providers, you automatically pick the more energy efficient ones.

Cloud is a huge drain on the available energy. But cloud also enabled lots of innovation and progress. At leafcloud, they’re trying to change the design.

Data centers are basically space heaters. Energy goes in, heat goes out. That’s not efficient. At leafcloud they place servers in buildings where they actually can use the heat, like apartment buildings. So… they use the heat of the servers to pre-heat tap water. They make a trade: you provide room and they provide warmth.

So… basically a distributed datacenter. They use glassfiber connections from their central datacenter to the leaf nodes. Storage is in the central location, it is compute that runs on the nodes. If a leaf goes offline, everything gets re-scheduled on different leaf nodes.

ArgoCD on the edge, managing k8s in weird places - Fabian Met

He showed a couple of use cases. Central office/location with remote locations: greenhouses, separate factories, equipment inside a huge factory. Sometimes unstable internet connection, sometimes limited physical access. And you still want to manage everything as easy as possible.

How do we manage clusters? Git, flat yaml files, git-based pipelines, helm charts. “Bah!” The only way, according to him, to run clusters is with a “gitops engine” like ArgoCD.

If you have your setup in git: nice. But how do you deploy it? Can there be manual changes? Do you give developers access to kubernetes so that they can play with their namespace? In that case the state isn’t necessarily in git, as it can be changed by hand/kubectl.

What’s better is something like ArgoCD. It will monitor your cluster and compare it with the desired state (in git). And it will roll back changes! So using a strict ArgoCD tool really helps to keep everything nice and clean.

So: no kubectl access to your cluster. For anyone. Only argocd. You can kill off your production cluster and re-create it. That should be possible.

Ok, ArgoCD… Now how do we manage k8s clusters in remote locations when the location has limited connection? A central argocd server isn’t the handiest in that case. But… argocd can maintain itself. It needs a yml with config to start itself. So you can have an argocd node in every location. It needs a git server (which you can run locally) for its yaml config. A quick cronjob to pull config from a central git location.

Observability: prometheus. Prometheus writes to a “live file”. After two hours, it starts a new file. The older files aren’t changed anymore. And thus they can be shipped (=cronjob) to the central location. Use Thanos to read those files as if it is a real prometheus server.

He’s sorry for those poor souls that have to use the ELK stack for logging. Grafana Loki is way nicer. Like Prometheus it writes the logging to a “live file” and rolls over every two hours. The archive files can again be shippped to a centralized location as soon as there’s an internet connection.

Protect data at the edge - Timothy Dewin

He works at veeam, originally a virtual machine backup company. They now also have a kubernetes offering (k10, https://kasten.io).

An example: dredging companies. Their ships come home into the harbor only every few years, that’s the only time they can replace the entire IT environment on the ship. Once on location, often a very remote location, the internet connectivity is really bad. 64kb/s, that kind of stuff. So backups often have to be done locally, on the edge.

Why backups? A backup is like insurance for your data center. Hardware can get broken. Software can get broken. But that’s not the biggest problem nowadays: ransomware is what they see most. What also occurs: rogue employees that delete data.

Kubernetes opened up a big “backup gap”. The devs run stuff on the cluster, but don’t know how the backup works. OPS needs to back it up, but doesn’t know what is running on the cluster.

Kubernetes offers nice high availability, but it is no backup. If the server location burns down, everything is gone. It also doesn’t help with non-node-failure events like data corruption or accidental deletion or ransomware. And it is difficult to have truly offsite backups. Recovery is also often complex.

Their “k10” backup solution is actually running inside the cluster as microservices. Protection/backup is at the namespace level. They back up not only the persistent volumes, but also the configuration of everything that’s running (as it might have changed compared to the git-based config files that you’re depending on).

He showed a demo (https://github.com/tdewin/stock-demo). One of the technologies that he used was ZFS and the “openebs” zfs storage provider. ZFS can do snapshots, that’s one of the advantages. At the end of the demo he deleted his namespace and then restored everything from backup.

Amsterdam python meetup

2022-04-06

Tags: python, django

(Summaries of a talk at the April 2022 py.amsterdam meetup).

Interesting takeaways from book ‘Test Driven Development With Python’ - Rok Klancar

The book “test driven development with python” is available for free onine at https://www.obeythetestinggoat.com/ . The subtitle is test driven development (TDD) for the Web, with Python, Selenium, Django, JavaScript and pals…

It is a quite thick book, but the author is very enthousiastic about programming and the subject of testing, so it actually very readable.

The book consists of three parts:

  • The basics of test driven development and django. So he starts out with explaining web development with django, just enough for having something real to test.

  • Web development sine qua non. He’s serious, at the end of part two you’ll actually have deployed a web application. (And you’ll have tested it, of course.)

  • More advanced topics in testing.

The core idea of test driven development is that you first write a (failing!) test and only then you write the code to get the tests to pass. The first test in the book is whether you have installed django (which you haven’t yet). That sets the tone for the rest of the book :-)

The whole time you’re writing the minimal code needed to get the tests to pass.

Different kind of tests are explained. Functional tests (browser tests in this case, written from the point of view of a user). Unit tests (much finer-grained, written from the point of view of the programmer).

The book is friendly for the reader. Also for a beginner. For instance, it even explains how to read a traceback.

An important concept: refactoring. Either refactoring/changing the code or the tests. Never do them at the same time! And only change code when all the tests are passing. Otherwise you’re like the refactoring cat. A cat jumping into a bath tub and totally panicking: that happens when the code is a mess, without passing tests. It is hard to work in an environment like that. But if you have running tests, it suddenly feels safe to change your code.

The second part even teaches you how to deploy your code to the server in an explicit way (with “fabric”: the appendices have an example with “ansible”). Also in the second part: validating your input.

Part three (“advanced topics”) deals with authentication, for instance. Exploratory code is explained: quick code without tests to explore something. And then of course, adding tests for it and getting it in shape. Also in this part: “mocking”, “continuous integration”, etc.

Everybody can learn something for this book. In a sense, it is a bit of a strange book because it explains a lot of different things. There’s also a bit of a patronizing tone in the book, which took some getting used to. But the author says in the preface that he wrote it for the younger version of himself, so that’s ok.

Graph-based stream processing in Python – Katarina Šupe

Katarina works at Memgraph, a little startup.

A graph is a network structure: a set of nodes (“vertices”) and a set of relationships between them (“edges”). Nodes and relations can have properties (key/value pairs).

Relational databases are what everybody uses. Why would you use a graph database? When is it useful? What are the differences?

  • Graph databases are generally “schemaless”, so you have lots of freedom with your properties.

  • Graph databases have build-in relations. A relational database isn’t necessarily very good/easy/efficient for many-to-many relations.

The most used language to query graph databases is the cypher query language. This is also used in their memgraph database.

Graph analytics (network analysis) generates insichgs hidden in the relationships of the network structure. Pagerank, shortest path, etc. are examples. Recommendation engines also use it. Supply chain risks and fraud detection are business examples.

Stream processing: you start out with (real time) input data, which goes into the “stream processing engine and the output goes on to another item or to an analytics dashboard or so.

They’re making cqlalchemy, a sort of “sqlalchemy for graph databases”. An python OGM (object graph mapper) for graph databases. The idea is that you can write python code instead of a cypher query.

Also interesting: MAGE, Memgraph Advanced Graph Extensions: an open source repo with graph analytics modules.

Lightning talk: managing your laptop - Reinout van Rees

I gave a quick lightning talk about managing your laptop. I got a new laptop yesterday and managed to set it up within a day. I got it working quickly by storing my config in git and also having a readme explaining my setup.

Two very handy tools:

Rotterdam (NL) 2022 python meetup summaries

2022-03-02

Tags: python, django, nelenschuurmans

(Some summaries of a talk at the February 2022 Rotterdam python meetup).

Technical note about running the live+online meetup

The meetup was partly live, partly online. The technical setup worked surprisingly well:

  • They had a microsoft Teams channel for the online folks.

  • A laptop was connected to that same channel and showed it on a big screen via a beamer in the room for the “live” folks.

  • The speakers had to connect to the Teams channel to show their slides both online and automatically also in the room.

  • A big microphone gave pretty good sound, even though it was some four meters away from the speaker.

Worked fine! Strange having a meeting without having to wrestle with hdmi adapters :-)

Pandas and excel tables done properly - Thijs Damsma

He showed https://github.com/VanOord/pandas-xlsx-tables

When using a jupyter notebook and pandas, you can easily load csv files and do stuff with it. Make nice graphs, for instance.

But… colleagues want xsl sheets…. So you can use a pandas xls exporter. But the output is a raw xls sheet. It works much better if you format the data in xls as a “table”: “format as table”. It sounds like it only formats it visually, but it actually figures out the headings and field types. You get proper sorting and so.

So he wrote a new exporter that exports it as a nice formatted table in excel. Much more useful.

Static code analysis and codemods - Matthijs Beekman

Sometimes people change the structure of python packages you depend on. They themselves as a company also have this problem: you want to evolve your internal libraries to improve them, but you also want to keep using it all the time.

There is something called “codemods” for automated code refactoring. Fewer manual changes to your code in response to changed library code. There are two basic ways of doing this:

  • Dynamic checking: basically manually. Running tests, for instance, and looking at the results.

  • Static analysis: parse code and analyse the structure. You don’t run code, but analyse as good as possible. Python type hints help here a lot. You can get a warning “use a DateTime instead of a three-item tuple” out of the static analysis if a function got refactored to use a datetime intead of a year/month/day tuple.

Some static analysis examples: mypy for static type checking, pylint (code linting), bandit (security testing), black (enforces coding standards).

These static analysers often work with the “ast”, python’s build-in abstract syntax tree. There’s also a “cst”, the concrete syntax tree which you can find in “lib2to3” and “libcst”. Libcst has good documentation on what it is.

At his company, they ship “codemods” together with the changed libraries. It doesn’t work for all corner cases, but it works for a surprising amount of cases, actually. They wrote a command line tool that you could tell to run a certain update.

How Python helps to keep the Netherlands dry - Rob van Putten

Rob works in civil engineering.

We have some 18.000 km of levees in the Netherlands. And we really need them. And… we need to assess them regularly! A key ingredient for calculating levee safety is soil information, the geometry of the levee and some extra parameters like expected water level.

  • Soil info is gathered by taking soil measurements. The standard “GEF” ascii files that are the output are of course easily read with python.

  • Levee geometry you can extract from height measurements. There’s really good data in the Netherlands and there are loads of python libraries to work with the raster data.

  • The parameters like river levels can be found in xls files and postgres databases. Again, there are python libraries for it.

Luckily, the standard program that is used for calculating the stability of levees has an api. Again: you can use python.

So… python can help you with a lot of things and help glue everything together.

But… look out for issues like data quality (BS in, BS out). And automatic calculations??? Engineers like to feel in control and don’t always want automation. Also a problem: management at companies that aren’t always very innovation-minded.

Some extra comments:

  • Don’t forget your tests.

  • Don’t forget documentation. Sphinx is great.

  • Python is great for super fast development.

  • Focus is hard. Python is nice, but there’s rust… unreal…. golang… flutter… Focus! Focus!

Upgrading old django websites with docker - Reinout van Rees

I also gave a talk which I’ll try to summarize :-)

At my company (Nelen & Schuurmans) we made a website for the Dutch government (Rijkswaterstaat), some 10 years ago. I helped build it. A website they used to visit all the municipalities along the mayor rivers in the Netherlands. Why? Well, the water levels keep increasing.

  • The website showed, for various scenarios, the height of the water in excess of the current levee/dike height. So a graph of excess height plotted against the length of the river.

  • Either the levee needs to be strengthened and increased in height (which isn’t always desirable, especially near towns)…

  • Or the river needs more room. A bigger floodplain by removing obstacles like disused brick factories near the river. Or moving a levee a bit back. Or re-using an old river arm as extra flood channel.

  • All those measures are pretty mayor civil engineering works, so you need buy-in from the municipalities and the people living there.

  • So the website showed the effect of the various measures. You could select them in the website and watch the graph with the excess height lower itself a little or a lot. That way, you could make clear which measures help a lot and which not.

Lots of measures were taken along the river Meuse (Maas) during the years. And… they were effective. In july 2021 lots of rainfall increased the water level to high levels, but… there were no mayor problems near the Meuse! I was happy to have contributed a bit.

But… on to the topic of the talk. The website was created some ten years ago with the intention of running it for three or four years. “Can we extend it for a year?”, “can we extend it for another year?”, “can we extend it one last time?”, “can we extend it for one really really last time?”. And last year again :-) There were quite some nods from the audience at this time: it sure happens a lot.

So you have an old django website running on an old python version on an old ubuntu server… How to update it? Often the ubuntu version on the server is older than the one on your laptop. You can try to get everything running with a newer ubuntu + newer django + newer python version, but that will lead to quite some frustration.

What’s better: an incremental approach. You can use docker to good effect for this.

  • First phase: pick a docker image matching the old ubuntu (or other linux variant) version on the server.

  • Add all the “apt-get” dependencies you’ve installed for the website.

  • Get the code running there, trying to pin as much as possible of the python dependencies to what’s actually on the server.

  • Do one update: update your django revision to the latest for your old version. If you have an 1.11.20, pick the latest 1.11.29.

  • Then enable deprecation warnings by running python with -Wa or by setting the PYTHONWARNINGS=always environment variable. Normally, you don’t want those warnings, but in this case they give you handy instructions on what to fix before you do the python or django update. The alternative is to “just” upgrade and to have a non-starting site due to import errors and so: then you have to figure it out yourself. Why not be lazy and use the info that python/django wants to give you?

Now you’ve got a good, clean local representation of what’s on the server, ready for further upgrades.

  • Second phase: upgrade your linux. Ubuntu “xenial” to “bionic”, for instance.

  • This automatically means a newer python version. Check that the site still runs/builds.

  • Probably you need to upgrade one or more libraries to make sure it works with the newer python version.

  • Again: fix the deprecation warnings.

Such a ubuntu/python upgrade normally doesn’t result in many changes. The next phase does!

  • Third phase: upgrade django. In the talk I said you could move in one go from one django LTS (long term support) to the next LTS, provided you previously fixed all deprecation warnings. But… when looking at my latest upgrade project, I moved from 2.2 => 3.0 => 3.1 => 3.2. So… disregard what I said during the meeting and just do what’s in the django documentation :-)

  • You probably need to unpin and upgrade all your dependencies now. Dependencies normally don’t support many different django versions, so if your site is a bit older, these upgrades will be necessary.

  • Fix deprecation warnings again to get your project in a neat state.

  • Check that everything works, of course. This includes running the tests, also of course.

If you do your upgrade project in these three phases, each individual phase will be quite doable. The first phase often is the hardest if the project is already quite old.


Quick personal note: one day after the meetup, a (Dutch) come-work-at-my-company video was ready. I really like to show it here :-)

PyGrunn keynote: learn pattern matching by writing a game - Łukasz Langa

2021-11-12

Tags: pygrunn, python

(One of my summaries of a talk at the 2021 10th Dutch PyGrunn one-day python conference).

Note: Łukasz Langa is the author of the wonderful black code formatter.

Note 2: I made a summary of a different pattern matching talk last month.

Łukasz started making a small game to learn about python 3.10’s new pattern matching functionality. Actually programming something that you want to finish helps you to really delve into new functionality. You won’t cut corners.

One of the things he automated in the past was a system to manage his notes, for instance to export notes marked “public” to his weblog. His notes are all in git. Lots of notes. An advice unrelated to the rest of the talk:

  • Keep notes.

  • Own your data.

  • Automate with python.

He showed the source code for his simple game. One of the methods was 15 lines of an if/elif with some more nested if/else statements. if isinstance(...) and so. He then showed the same code with the new pattern matching of python 3.10. Matching on types, matching on attribute values.

match and case may seem very weird now in the way they are implemented. But he thinks they can become pretty useful. You won’t use them a lot, normally. But in some cases it’ll make your code more neat and clear. It seems useful.

PyGrunn keynote: make it work. fast. - Alexander Solovyov

2021-11-12

Tags: pygrunn, python, django

(One of my summaries of a talk at the 2021 10th Dutch PyGrunn one-day python conference).

He is the CTO of a big Ukraine fashion marketplace. 10-20k orders per day. So the talk is about them surviving load spikes and so.

In 2016 they had a clojure/clojurescript/react single page app. They saw 30% more requests per second, which caused 3x the processor load. Bad news… One of the things he used was clojure.cache and picked the fast memory cache option. After finally reading the documentation, he discovered it was the cause of their problem. A cache call would fail, which would end up in a retry loop which would in effect cause almost an infite loop. Oh, and his son was only two weeks old and he was sleep-deprived. He managed to replace clojure.cache by memcached, which solved the problem.

Halloween 2017. Wife in hospital. They started losing TCP packets… The main.js was barely loading which is bad in a single page web application :-) The processor load on the load balancers just kept increasing. One of the problems was the marketing department that recently added a fourth level to the menu structure of the website. Which resulted in a 3MB json file with the full menu. To compensate a bit, they increased the gzip level to “9” which made it a little bit smaller. But that also meant a huge increase in the load on the (bad) load balancer that had to do the compressing. Putting it back at “5” solved the issue…

A regular non-busy day in july. Son is in hospital after a vaccine shot. He was also in the hospital. What can go wrong? Well, the site can go down in the night due to a DDOS attack. The solved it by doing a quick if/else on the attacker’s user agent string in the UI code…

2018, they did a pre-shopping-season load test. It turned out their database was hit quite hard. So they used pg_stat_statements to check all their queries. The table with products was the one being hit hard. Which was strange, because they cached it really well. Only… the cache wasn’t working. They missed a key in their cache setting…

Black friday 2018 went without a hitch.

16 november 2020. Black friday just around the corner. But after a new release, the app suddenly starts eating CPU like crazy. Deploying the old release helped. But… the changes between the old and new version didn’t look suspicious. What to do? The took a profiler and started looking at the performance. It turned out some date parsing function was mostly to blame. Parsing dates? Yes, they just started a marketing promotion with a credit card provider with an offer limited to a specific date. So they added the date to the config file. And their was some tooltip showing the date. And there was some library they used that tried some 20 date formats every time… The solution? Parse the config’ed date once upon application startup…

Later they had a problem talking to the database. JVM problem? Can it be the network? Postgres driver problem? PGbouncer? Postgres itself? No. Everything seemed to be working well, only it didn’t work. 20 hours later they stopped everything and started manually executiong SQL select statements in desperation. …. and many of them stayed stuck without an error message??? In the end it was one corrupted file in postgres. So even the super-reliable postgres isn’t always perfect.

 
vanrees.org logo

About me

My name is Reinout van Rees and I work a lot with Python (programming language) and Django (website framework). I live in The Netherlands and I'm happily married to Annie van Rees-Kooiman.

Weblog feeds

Most of my website content is in my weblog. You can keep up to date by subscribing to the automatic feeds (for instance with Google reader):