Djangocon: distributed systems - Raphael Barrois¶

(One of the summaries of a talk at 2014 djangocon.eu)

Raphael Barrois has this definition for distributed systems: they are a set of autonomous, interconnected services. Possibly a different codebase. Each service with its own database.

Why? When?¶

For elaborate systems you can make a nice tidy elaborate architecture. But after two years of bugfixing and feature-adding and changes it inevitably becomes a huge sprawling mess.

So… split it up! Into separate components.

When would you start doing something distributed-like? When a new feature has to be added that’s not an incremental improvement. A feature that is strongly disconnected from the core features.

Another reason can be that you install a new version of your main project somewhere, possibly in another datasenter or with a separate database. It still has to be operated by the same ops team, though.

Or there could be technical reasons like scaling/sharding or geographic expansion or debundling/upgrading.

Building it¶

Where should you start? A good point is to extract generic code. misc, util, tools. Everything that’s not business-specific. Perhaps also some basic business components: common models, UI, etc.

Put this kind of code into separate models and adapt your deployment process for fast-moving internal dependencies as you’ll want to be able to adjust those common components often.

Provide an integrated UI/UX, for instance. Extract your common UI templates, fix up something with inter-project headers and normalize your assets pipeline (with bower/grunt for instance).

For the UI/UX, an efficient SSO (single sign on) solution is very handy. SSO isn’t the same as “centralized authentication”, though. It is just log-in-once, for all your services. OpenID-like for web customers and, if you can get it working, kerberos for your company-internal authentication.

About kerberos:

It is efficient, secure authentication. Works with active directory and most of the others. OSX, windows, linux: everything works. You authenticate once and you’re automatically logged in for the websites that support it. Firefox, chrome, IE, apache, nginx: all supported.
You need to explicitly mark the sites as trusted in a configuration file. The fallback is an ugly http authentication box. In Django terms: you get a REMOTE_USER header that you can trust.

A few rules for splitting:

You need a single source of truth. Otherwise you’ll go mad when debugging.
Split per business domain, not per object/table.
Think about redundancy/availability.

A mistake to avoid: don’t rewrite the whole module before splitting. Just enforce the target API within the existing source code, simply running still on the same host. Then start the remote module, running exactly the same code. Then add a dispatcher that can call both the local and the remote modules. Once everything is working well, switch to “full remote” mode. As the last step, remove the dispatcher.

RPC (remote procedure calls)¶

REST APIs are standard issue nowadays. Raphael advices to steer clear of them at the start. You know exactly which methods you need, so RPC is fine. You’ll avoid lots of performance issues. And… it is easier to port from internal function calls.

When choosing an RPC technology, you have to watch for authentication/security, debugging/logging and for possibilities to change the schema. Are they simple to do?

Regarding security: don’t trust the frontend. Ask to push a “proof of identity” first.
Changing the schema? You will go through several iterations of the API. Avoid complex migrations by first deploying clients that support the old and new API, then a server with both old and new, then switch the client to the new and as last step switch all the servers to solely the new API.

Allow optional fields with defaults. And ignore unkown fields.

RPC: which technology to use? Transport, serialization, application layer, those are the three layers.

Transport. HTTP good at load balancing and debugging, bad for binary data. AMQP: load balance, async and binary are good, time-critical request/response is bad. ZeroMQ is good for async, binary and performance, but you’ll need to build your own load-balancing. You could use plain TCP or UDP, but that means building all your own stuff.
Serialization. You’ll need complex, nested objects. Efficient packing. Schema validation and updates if possible. Optional: custom/specialized datatypes.

msgpack, protobuf are possible. JSON is bad at packing and there’s no schema for custom types. XML has a very strict schema and advanced types but bad packing. Python pickles: NO, too tied to python.
Application layer. Authentication, explicit schema and unambiguous serialization are needed. Ideally logging/debugging, simple API calls and simple server-side setup.

JSON-RPC is well-defined, hasn’t got auth, it has an implicit schema. XML-RPC basically the scheme. SOAP/XML is an option. JSON/REST too.

Recommended: start with a simple stack. http transport, json serialization and a json-rpc/REST like protocol. For more advanced use, look at zeroMQ/AMQP transport, protobuf serialization, [missing] protocol. Yeah, that “[missing]” is a problem.

Keeping it running¶

Reliability/redundancy is key. Ensure that the role of each service is clear. Is it the only source for some data? Must it be queried every time? Ask those kinds of questions.

Upgrades are a challenge. If it is a critical service, try getting a progressive upgrade to work. And play with load-balancers. Seldom used backends: it is up to you.

For big changes, increase the API version number and use a new code path. Make sure you monitor the old service to spot non-upgraded clients!

Release often, if only to upgrade dependencies. Keep those up to date! This avoids using obsolete code and it helps with future feature-based upgrades. And… the less versions of libraries you have in production, the easier it is to track and fix bugs.

(Raphael had some good comments on doing testing, but I didn’t manage to get that written down. I hope he posts his slides online.) Update: his slide are now available, thanks!