Reinout van Rees’ weblog

PyGrunn: ‘all you need is less’, rethinking big data - Berco Beute

2017-05-19

Tags: pygrunn

(One of my summaries of a talk at the 2017 PyGrunn conference).

He did some thinkwork lately. So he wants to talk a level higher. The systems that we’re building are increasingly bigger and increasingly more important. If something goes down, often real problems occur.

Similarly, it is important to talk about software ethical thinking. What do you build? is it OK to build it?

Perhaps more important than the actual software: the data. There’s a difference between data and information. Information is data that’s been processed/interpreted/organized/categorized so that you can do more with it.

Data by itself is not meaningful. Once you add context, it starts to come to life. A context can be your application. In a way, “big data” is thus meaningless. Meaningless without (your application’s) context.

He showed a quote by Frank Zappa

Information is not knowledge.
Knowledge is not wisdom.
Wisdom is not truth.
Truth is not beauty.
Beauty is not love.
Love is not music.
Music is the best...

The amount of available data is huge. It grows and grows and grows every year. A “cambrian explosion”. Much of this data is generated by us. Should we generate it? Should we keep it around?

The “internet of things” will only make the growth bigger.

Data is scattered across organisations. Offline and online. Look at the amount of data that your government has about you. The Dutch government is pretty OK, but quite a number of governments.... do you want them to have lots of data about you?

Information is also scattered accoss (your) devices.

The question: is that a problem?

  • There’s a cost component. Storing data costs money. It is expensive to collect. The less you collect, the less you have to pay.

  • It is vulnerable! Data is a liability: you’re responsible for the data you collect and store. So you have to take lots of care! Data is a valuable good: for others as well. It is worth having and it is worth stealing.

    See also ransomware.

    Data can also be manipulated. You can be misguided. Or stolen personal data can be used for fraud or stealing stuff.

  • It impairs your vision. Do you still see the trees for the forest?

  • Shifting context blurs the meaning of data. Data collected for one purpose might not be correct in another.

What is the cause?

  • Ehm, why do we copy data so much? It looks like hoarding.

    One reason is the computer itself. Originally, they weren’t connected. So you had to copy data to get it from one to the next computer. And originally they had very little memory, so you had to copy data in and out of the computer. Now everything is connected and we have huge amounts of memory.

    We, as an industry, also have an almost build-in need to work with the original data and to all the calculations ourselves. Isn’t the actual information we want, isn’t that enough?

Is there another way?

  • Stop copying! Do you really need the data? Locally? Given enough bandwith and low enough latence, copies are not needed. (See the George Gilder book “telecosm, the world after bandwith abundance”).

    Is it so strange? No. Who still has CDs? Who has spotify?

  • Information minimalization. There should be a strict rule “don’t collect more information than you need”.

    Design information systems to require as few data as possible.

  • Start with the process. Process-based design instead of data-centric design. When starting with the data, you seem to always need more and more. If you start with the process, you look at what you really need.

  • Contract-based systems. He thinks we’ll slowly go in that direction. Interactions will be much more negotiation-like. You’ll ask for the information you need.

    An example. An online store won’t ask you for your address directly. It will ask you for permission to request the information from the government. There’ll be a “contract” between the store and the government that deals with the use the store is allowed to make of the address information.

    A bartender only needs to know whether you’re 18+ or not, he doesn’t need to have all the information that’s in your passport that you need to currently show him to prove your age.

Conclusion

  • Stop copying.
  • Always ask yourself: do I really need all that data?
  • Information minimization.
  • Stick to the processs.
  • Try to think in contracts.

See also the related “Web11” conference: http://web11.org/

https://abload.de/img/screenshot2017-05-08ak8y74.png

Photo explanation: just a nice unrelated picture from the my work-in-progress german model railway

Dutch note: python+django programmeren in hartje Utrecht bij de oude gracht? Watersector, dus veel data en geo. Leuk! Nelen&Schuurmans is op zoek. Stuur mij maar een mailtje, want de vacaturetekst staat nog niet online :-)

PyGrunn: google machine learning APIs for python developers - keynote from Google Cloud

2017-05-19

Tags: pygrunn, python

(One of my summaries of a talk at the 2017 PyGrunn conference).

Lee Boonstra and Dmitriy Novakovskiy gave the keynote.

Python at google. Python is widely used at google, it is one of its official languages. It is integral to many of the google projects, for instance youtube and ‘app engine’. And lots of open source libraries. Every API has its own python client.

Google for python developers. What can you do as a python programmer on google? Google cloud platform. It consists of many parts and services that you can use.

  • You can embed machine learning services like tensorflow, speach API, the translation API, etc.
  • Serverless data processing and analytics. Pub/Sub, bigquery (map/reduce without needing to run your own hadoop cluster), etc.
  • Server stuff like kubernetes, container services.

Machine learning. There have been quite some presentation on this already. Look at it like this: how do you teach things to your kids? You show them! “That is a car”, “that is a bike”. After a while they will start learning the words for the concepts.

Machine learning is not the same as AI (artificial intelligence). AI is the process of building smarter computers. Machine learning is getting the computer to actually learn. Which is actually much easier.

Why is machine learning so popular now? Well:

  • The amount of data. There is much more data. So we finally have the data we need to do something with it.
  • Better models. The models have gotten much better.
  • More computing power. Parallellization and so. You now have the power to actually do it in reasonable time.

Why google? Tensorflow is very popular (see an earlier talk about tensorflow).

You can do your own thing and use tensorflow and the cloud machine learning engine. OR you can use one of the google-trained services like the vision API (object recognition, text recognision, facial sentiment, logo detection). Speech API/natural language API (syntax analysis, sentiment analysis, entity recognision). Translation API (realtime subtitles, language detection). Beta feature: the video intelligence API (it can detect the dogs in your video and tell you when in the video the dogs appeared...).

Code and demos. She gave a nice demo about what she could recognize with google stuff in an Arjan Robben image. It even detected the copyright statement text at the bottom of the photo and the text on his lanyard (“champions league final”). And it was 88% sure it was a soccer player. And 76% sure it might be a tennis player :-)

Using the API looked pretty easy. Nice detail: several textual items that came back from the API were then fed to the automatic translation API to convert them to Dutch.

Tensorflow demo. He used the MNIST dataset, a set of handwritten numbers often used for testing neural nets.

Dataflow is a unifield programming model for batchor stream data processing. You can use it for map/reduce-like operations and “embarrassingly parallel” workloads. It is open sourced as apache Beam (you can use it hosted on google’s infrastructure).

The flow has four steps:

  • Cloud storage (storage of everything).
  • dataflow.
  • Bigquery (data storage).
  • Data studio (data visualization).

(The demo code can be found in the sheets that will be available, googling for it probably also helps).

https://abload.de/img/screenshot2017-05-08a1pzqd.png

Photo explanation: just a nice unrelated picture from the my work-in-progress german model railway

Dutch note: python+django programmeren in hartje Utrecht bij de oude gracht? Watersector, dus veel data en geo. Leuk! Nelen&Schuurmans is op zoek. Stuur mij maar een mailtje, want de vacaturetekst staat nog niet online :-)

PyGrunn: deep learning with tensorflow (“Trump tweet generator”) - Ede Meijer

2017-05-19

Tags: pygrunn, python

(One of my summaries of a talk at the 2017 PyGrunn conference).

He originally used a java solution for ‘machine learning’, but that didn’t work very comfortably. He then switch to tensorflow, written in python.

Machine learning is learning from data without being explicitly programmed. You feed the computer lots of data and it learns from that. Some examples of the techniques used: linear regression, logistic regression, decision trees, artifical neural networks and much more.

Artificial neural networks are what tensorflow is about. A normal neural network has “input layer”, a “hidden layer” and “an output layer”. The nodes in the three layers are connected. The neural net tries to learn the “weights” of the connections.

Deep learning means you have neural networks with multiple hidden layers. Often it deals with features at different levels of abstractions. Images that have to be recognized can be cut in to several pieces of different sizes and fed to the net as those parts, but also as the full image. Training a model often works wit minimizing error by using the “gradual descent” method.

Tensor flow? What are tensors? Well, a 0D tensor is a scalar, 1D a vector, 2D a matrix, etc.

Then he started showing code examples. The ultimate test was trying to learn the network to talk like Trump. So it got fed a huge number of Trump twitter messages :-) It worked by defining a number of “layers” in order to predict the next character in a tweet. Very low-level, thus.

In the end the generated tweets started to look like Trump tweets. The code is here: https://github.com/EdeMeijer/trumpet

https://abload.de/img/screenshot2017-05-08ai9z30.png

Photo explanation: just a nice unrelated picture from the my work-in-progress german model railway

Dutch note: python+django programmeren in hartje Utrecht bij de oude gracht? Watersector, dus veel data en geo. Leuk! Nelen&Schuurmans is op zoek. Stuur mij maar een mailtje, want de vacaturetekst staat nog niet online :-)

PyGrunn: looking at molecules using python - Jonathan Barnoud

2017-05-19

Tags: pygrunn, python

(One of my summaries of a talk at the 2017 PyGrunn conference).

He researches at fat molecules. He applies simulation to molecules. F = m * a (+ some much more elaborate formulas). With an elaborate simulation, he was able to explain some of the properties of fat (using a “jupyter” notebook).

His (python) workflow? First you need to prepare the simulation. He did have (of did build) a simulation engine. The preparation takes text files with the following info:

  • Topology.
  • Initial coordinates.
  • Simulation parameters.

Those text files are prepared and fed to the simulation engine. What comes out is a trajectory (a file with the position and direction and speed of every single molecure for all timesteps).

The next step is analysis. A problem here is that various simulation engines export different formats.... Similar problem with the input, btw...

Luckily we’ve got python. And for python there are a huge amount of libraries. Including “MDAnalysis” (http://www.mdanalysis.org/) , a library that can use these trajectory files. The output: python numpy arrays. Nice! This way you can use the entire python scientific stack (numpy, scipy, etc) with all its power.

Numpy? Made for matrices. So you can work with your entire data set. Or you can filter, mask or slice your data.

Thanks to these tools, he can experiment with the data in a comfortable way. And make plots.

But... movies with the 3D simulation are better! So: https://github.com/arose/nglview . His molecules are moving around. He can even place a graph nearby and bind the timeline in the graph and the 3D visualization together. It is all python!

A problem he had was his directory structure. Lots of directories with simulation config files with different settings. A mess. So: http://datreant.org/ , “persistent, pythonic trees for heterogeneous data”.

Summary:

  • Python is awesome.
  • Jupyter is awesome too.
  • The python science stack is awesome as well.
  • Each field develops awesome tools based on the above.
https://abload.de/img/screenshot2017-05-13a1jzx1.png

Photo explanation: just a nice unrelated picture from the my work-in-progress german model railway

Dutch note: python+django programmeren in hartje Utrecht bij de oude gracht? Watersector, dus veel data en geo. Leuk! Nelen&Schuurmans is op zoek. Stuur mij maar een mailtje, want de vacaturetekst staat nog niet online :-)

PyGrunn: django localization (‘l10n’) - Cees van Wieringen

2017-05-19

Tags: pygrunn, django

(One of my summaries of a talk at the 2017 PyGrunn conference).

What is L10N (=LocalizatioN)? You have multiple terms like that.

  • Translation (t9n): just translating text.
  • Localization: the adaption of a product or content to a specific language or culture. This includes localization. But also date/time formats. Currency. Units of measurement. Number formatting.
  • i18n (internationalization) is preparing your product for localization. Using unicode. Django translation toolkit.

Translation in django. Enable it in your settings.py by adding I18N = True. And use _() around your strings.

Where does the underscore function come from? It is a gettext function:

from django.utils.translation import gettext as _

There are multiple gettext variants (gettext_lazy, for instance), so he thinks it best to use the full function name instead of the customary underscore.

Sometimes you have terms that can be translated in multiple ways. Django has a gettext function that allows adding a “domain”. “Spring” can be a “season”, but it can also be a “mechanical part”.

For translating models, there are multiple django apps that handle it and that store the various translations in the database. But there is none that tries to grab the translation from the existing gettext *.po translation files. So he build one himself (“TransField”).

Localization trick for lengths: use Distance from the contrib.gis package. You can store a length as “300m” and then ask for distance.km() and distance.yard(). They added similar classes for volumes and temperatures and areas and masses.

Then they build model fields like AreaField and MassField. For localization you can provide defaults (“I want my lengths in km and my weights in kg”). When outputting values, the correct localization and unit are applied automatically.

The code is at https://github.com/ceasaro/django-l10n-extensions

A recommended video (8 minutes) that shows all the problems you might encounter when translating: https://www.youtube.com/watch?v=0j74jcxSunY

https://abload.de/img/screenshot2017-05-08atja0y.png

Photo explanation: just a nice unrelated picture from the my work-in-progress german model railway

Dutch note: python+django programmeren in hartje Utrecht bij de oude gracht? Watersector, dus veel data en geo. Leuk! Nelen&Schuurmans is op zoek. Stuur mij maar een mailtje, want de vacaturetekst staat nog niet online :-)

PyGrunn: creating abstraction between consumer and datastore - Marco Vellinga

2017-05-19

Tags: pygrunn, django

(One of my summaries of a talk at the 2017 PyGrunn conference).

Marco Vellinga started his talk with train-related analogies and photos, which I of course liked. They have a big monolythic application. Like a big train: move one part (a locomotive) and the rest also moves. In an application you don’t want that: if you make a change in one part of the application, you don’t want to have to change lots of other parts. Decoupled is better.

With abstraction, for instance a user object with username, password and billing information, you can change the backend (splitting user and billing info, for instance) without needing to change anything in the frontend.

They created an abstraction layer containing

  • core logic
  • validation
  • error handling
  • data contracts

There isn’t really support for such a layer in Django, so you have to build it yourself. This was a lot of work and took a couple of tries to get it right.

They used Marshmallow for json serialization. For serializing queries (the equivalent of django’s ‘Q’ objects) they build filterql.

For (API) versioning they build their own function/class decorator to mark functions as “version 1” or “version 2”. They’re calling it versionary. It is kind of spooky, but it is fully tested and documented.

The main question: is it worth building? Well, it paid off instantly when they put it into production because they immediately found invalid data that was already in the system. Because the abstraction layer put all the validation in one place, all the data that previously slipped through was now found out :-)

https://abload.de/img/screenshot2017-05-08aiql1f.png

Photo explanation: just a nice unrelated picture from the my work-in-progress german model railway

Dutch note: python+django programmeren in hartje Utrecht bij de oude gracht? Watersector, dus veel data en geo. Leuk! Nelen&Schuurmans is op zoek. Stuur mij maar een mailtje, want de vacaturetekst staat nog niet online :-)

PyGrunn: docker do’s and don’ts - Maarten Brugman

2017-05-19

Tags: pygrunn

(One of my summaries of a talk at the 2017 PyGrunn conference).

They started using docker with much enthousiasm, but some things went wrong.

Many people use the pet/cattle analogy. Servers used to be pets: lovingly cared for and raised. With docker you can see it as cattle: something wrong, just grab a new one.

Of course they started out by treating their dockers as lovingly maintained pets. Manual changes and then re-saving the image...

  • Don’t install stuff inside a running container. Put it in the Dockerfile.
  • Don’t pull from version control into containers. Instead build a container that contains your app.
  • Don’t keep containers running for a long time without security updates. Build a fresh one regularly.
  • Don’t just expose ports as for instance 8000:8000, as docker exposes them on 0.0.0.0 by default, so also on all your server’s external interfaces. The basic “ufw” won’t work, as Docker opens up those ports with iptables anyway. So explicitly open them up on 127.0.0.1.

The most important “Do”:

  • Learn how to set up your containers as cattle!!! They are disposable. They should be created and disposed off without a second thought.

You can use containers as a light-weight VM. It is not the default docker approach, but you can use a docker that includes a database and logging. It can start up much quicker than lots of separate containers.

Containers can make it easier to move to cloud providers. He works for a company that needs to install lots of (geographical) dependencies onto their servers. That is sometimes a problem. With docker, you can make sure all dependencies are in your Docker and you can use a random docker-supporting cloud provider.

Docker supports the “open container initiative” (OCI) to make it more portable. There are docker alternatives.

Summary: containers are great. They do have a learning curve. Using them as cattle (instead of pets) is best. And... use docker-compose or another orchestration tool.

https://abload.de/img/screenshot2017-05-08ahqa6h.png

Photo explanation: just a nice unrelated picture from the my work-in-progress german model railway

Dutch note: python+django programmeren in hartje Utrecht bij de oude gracht? Watersector, dus veel data en geo. Leuk! Nelen&Schuurmans is op zoek. Stuur mij maar een mailtje, want de vacaturetekst staat nog niet online :-)

Suddenly inaccessible CIFS mounts on linux

2017-05-16

We suddenly saw some failing CIFS (smb) mounts on servers in our (zabbix) monitoring.

  • Hurray for monitoring and especially also custom monitoring scripts that caught this specific error.
  • Huh? Why is it suddenly failing?

The error when (re)mounting the share:

mount error(112): Host is down

Pinging the host of course works just fine. 10 seconds of googling lead us to this serverfault.com answer which contained the solution: add vers=2.0 to your mount options.

The reason: if you’ve correctly (and recently) patched your windows servers, you’ve also disabled the SMB version 1 protocol. And somehow linux/smb doesn’t automatically detect version 2, so you have to set it explicitly.

There might be something that I’ve missed, but at least our CIFS mounts work again :-)

HTTPS behind your reverse proxy

2017-05-02

Tags: django, python

We have a setup that looks (simplified) like this:

https://abload.de/img/screenshot2017-05-02a69bku.png

HTTP/HTTPS connections from browsers (“the green cloud”) go to two reverse proxy servers on the outer border of our network. Almost everything is https.

Nginx then proxies the requests towards the actual webservers. Those webservers also have nginx on them, which proxies the request to the actual django site running on some port (8000, 5010, etc.).

Until recently, the https connection was only between the browser and the main proxies. Internally inside our own network, traffic was http-only. In a sense, that is OK as you’ve got security and a firewall and so. But... actually it is not OK. At least, not OK enough.

You cannot trust in only a solid outer wall. You need defense in depth. Network segmentation, restricted access. So ideally the traffic between the main proxies (in the outer “wall”) to the webservers inside it should also be encrypted, for instance. Now, how to do this?

It turned out to be pretty easy, but figuring it out took some time. Likewise finding the right terminology to google with :-)

  • The main proxies (nginx) terminate the https connection. Most of the ssl certificates that we use are wildcard certificates. For example:

    server {
      listen 443;
      server_name sitename.example.org;
      location / {
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header Host $http_host;
        proxy_set_header X-Forwarded-Proto https;
        proxy_redirect off;
        proxy_pass http://internal-server-name;
        proxy_http_version 1.1;
      }
      ssl on;
      ....
      ssl_certificate /etc/ssl/certs/wildcard.example.org.pem;
      ssl_certificate_key /etc/ssl/private/wildcard.example.org.key;
    }
    
  • Using https instead of http towards the internal webserver is easy. Just use https instead of http :-) Change the proxy_pass line:

    proxy_pass https://internal-server-name;
    

    The google term here is re-encrypting, btw.

  • The internal webserver has to allow an https connection. This is were we initially made it too hard for ourselves. We copied the relevant wildcard certificate to the webserver and changed the site to use the certificate and to listen on 443, basically just like on the main proxy.

    A big drawback is that you need to copy the certificate all over the place. Not very secure. Not a good idea. And we generate/deploy the nginx config for on the webserver from within our django project. So every django project would need to know the filesystem location and name of those certificates... Bah.

  • “What about not being so strict on the proxy? Cannot we tell nginx to omit a strict check on the certificate?” After a while I found the proxy_ssl_verify nginx setting. Bingo.

    Only, you need 1.7.0 for it. The main proxies are still on ubuntu 14.04, which has an older nginx. But wait: the default is “off”. Which means that nginx doesn’t bother checking certificates when proxying! A bit of experimenting showed that nginx really didn’t mind which certificate was used on the webserver! Nice.

  • So any certificate is fine, really. I did my experimenting with ubuntu’s default “snakeoil” self-signed certificate (/etc/ssl/certs/ssl-cert-snakeoil.pem). Install the ssl-cert package if it isn’t there.

    On the webserver, the config thus looks like this:

    server {
        listen 443;
        # ^^^ Yes, we're running on https internally, too.
        server_name sitename.example.org;
        ssl on;
        ssl_certificate /etc/ssl/certs/ssl-cert-snakeoil.pem;
        ssl_certificate_key /etc/ssl/private/ssl-cert-snakeoil.key;
    
        ...
    }
    

    An advantage: the django site’s setup doesn’t need to know about specific certificate names, it can just use the basic certificate that’s always there on ubuntu.

  • Now what about that “snakeoil” certificate? Isn’t it some dummy certificate that is the same on every ubuntu install? If it is always the same certificate, you can still sniff and decrypt the internal https traffic almost as easily as plain http traffic...

    No it isn’t. I verified it by uninstalling/purging the ssl-cert package and then re-installing it: the certificate changes. The snakeoil certificate is generated fresh when installing the package. So every server has its own self-signed certificate.

    You can generate a fresh certificate easily, for instance when you copied a server from an existing virtual machine template:

    $ sudo make-ssl-cert generate-default-snakeoil --force-overwrite
    

    As long as the only goal is to encrypt the https traffic between the main proxy and an internal webserver, the certificate is of course fine.

Summary: nginx doesn’t check the certificate when proxying. So terminating the ssl connection on a main nginx proxy and then re-encrypting it (https) to backend webservers which use the simple default snakeoil certificate is a simple workable solution. And a solution that is a big improvement over plain http traffic!

Fossgis: vandalism in openstreetmap - Pascal Neis

2017-03-24

Tags: fossgis

(One of my summaries of a talk at the 2017 fossgis conference).

The number of daily users has almost doubled in the recent years.

Vandalism? Deleting an entire city block? You must distinguish between accidental changes and real vandalism. It is easy to try out an editor and change something in the real database.

  • Vandalism. Deleting items.
  • Bad editing behaviour.
    • Mass edits. Imports and bots.
    • “Edit wars”.
    • Missing knowledge. Trying things out. Adding material with the wrong copyright. Removing a railroad because it looks better on their map and not knowing that they’re editing the real map.

An example: pokemon. People thought that the pokemon app uses openstreetmap data to locate footpaths and parks. So suddenly people started adding parks and footpaths around their house. Luckily they often added “park for pokemon” in the comments.

He invested 15-45 minutes a day during a month to try and detect vandalism. Lists of new users: review their edits. For instance. The best solution if you see something that’s not right: ask them in a friendly way. Stay relaxed. As a second step, notify local mappers. As a last ditch measure: revert the edit.

What he saw:

  • 20% real vandalism.
  • 24% had already been reverted (but not documented and such).
  • 23% was test data.
  • 13% fictive data.

Regarding the answers he got back: out of 300 comments, he got only 70 answers. 20 of those were of the original author of the changeset. All those 20 answers were friendly! He did have to wait about 24 hours on average before getting an answer. So don’t expect an email right away.

He showed a lots of nice statistics. For that, you’ll probably have to look at his blog. Start at http://neis-one.org/2017/01/reviewing-osm-contribution-1/, you’ll find links to the tools he used there. For instance http://resultmaps.neis-one.org/osm-suspicious .

He has some ideas for future tools. Push instead of pull notifications. Coordination of reviews. etc.

 
vanrees.org logo

About me

My name is Reinout van Rees and I work a lot with Python (programming language) and Django (website framework). I live in The Netherlands and I'm happily married to Annie van Rees-Kooiman.

Weblog feeds

Most of my website content is in my weblog. You can keep up to date by subscribing to the automatic feeds (for instance with Google reader):