Reinout van Rees’ weblog

Large scale search for social sciences - Wouter van Atteveldt

2017-11-22

Tags: python, django

(Summary of a talk at a Dutch Python meetup). Full title: large scale search and text analysis with Python, Elastic, Celery and a bit of R.

Wouter teaches political sciences at the university and uses a lot of text analysis. He showed an example of text analysis of the Gaza-Isreal conflict, comparing USA media talking about Isreal and Chinese media talking about Israel. You saw big differences! The USA media talks more about rocket attacks by Hamas on Israel. Chinese media talks more about the results on the ground: invasion, deaths, etc.

Text analysis is very important for social sciences. There’s a flood of digital information, online and archived. Major newspapers in the Netherlands have been digitalized by the royal library, for instance. Lots of text to work with. You see the same with facebook and so: they can extract lots of info from the texts people type in!

Facebook once did an experiment on positive/negative tweaks to timelines. Totally bullshit from a scientific viewpoint. So we cannot leave social research to the likes of facebook. So.... there is a need for a good tool, especially for education. They build it in python.

Why python? Open source, platform independent, relatively easy to learn. Large community. Lots of tools like django, numpy including large communities around it.

He also uses R. Also open source. Mostly a question of “go where the users are” as most social scientists are used to statistical languages like R.

What they build is https://amcat.nl, Amsterdam Content Analysis Toolkit. He demoed it with textual searches in old newspaper articles. Nice graphs, for instance, with occurrence of words in the various years.

AmCAT is written in python and django. Fully open source. Postgres, celery. The search happens with elasticsearch. The articles are all in elasticsearch, but for safety’s case they’re also keeping them in postgres. Perhaps not needed, but...

They use django management commands for a reasonably friendly command line interface. Both for maintenance commands and for queries.

Queries can be quite elaborate and take quite some time. Longer than the regular web timeout. For that, they use Celery for offloading long-running tasks.

Text analysis: for that you need NLP, Natural Language Processing. There are many different good tools like CoreNLP for English, Alpino for Dutch, ParZU for German. Only.... they’re hard to use for social scientists: hard to install and operate.

What they build themselves was NLPipe, a simple (Flask) website for interacting with the NLP libraries, working as a job manager. You get the results out as csv or json.


Ongerelateerde reclame: bij ons python programmeren in hartje Utrecht? Mail me even :-)

Building robust commandline tools with click and flask - Wojtek Burakiewicz

2017-11-22

Tags: python, django

(Summary of a talk at a Dutch Python meetup).

Wojtek likes simple tools. His presentation is about a recent tool they build: an internal tool that deploys python applications to servers with virtualenvs. It also configures nginx.

They based the architecture around microservices. So you basically write a simple script that talks to a couple of APIs that in turn talk to other APIs.

They started out with stash, artifactory, ansible, jenkins, supervisord and JIRA. All except JIRA have a nice API. Their deploy tool talked to those APIs. One problem was authentication. One tool needs user/pass, the other an auth token, the next an ssh key... Another problem was network segmentation. You don’t want every laptop to talk to your production environment...

The solution was to use one intermediary API. So the command line tool talks to the intermediary API which in turns talks to the APIs mentioned above.

Another advantage of an intermediary API is that you can unify the concepts. You can just talk about “host” and “application”, even though in Jenkins/ansible/etc it might be a “job” or a “machine.

You can also exchange components! You can switch from Stash to Bitbucket without the user of the deploy tool noticing.

They used flask. If you compare it to django:

  • When you install django, you get a house.
  • When you install Flask, you get a single brick.

There are all sorts of libraries you can use to get all the functionality you would get with django. Only you get to install and compose it yourself.

For the command line, he used click, a nice alternative to argparse (he dislikes argparse). From the homepage: Click is a Python package for creating beautiful command line interfaces in a composable way with as little code as necessary.

When you distribute command line tools, people will have older versions. So you need to get the users to upgrade in some way.

They did it in three ways:

  • By returning a special header from the API with the desired command line version. The command line tool can then ask the user to upgrade when needed.
  • The commnad line tool also passes its version to the API. If the API needs a newer version, it can return an error code.
  • If an update was needed, it would also print out the necessary pip command as an extra service.

Ongerelateerde reclame: bij ons python programmeren in hartje Utrecht? Mail me even :-)

PyCon.de: friday lightning talks

2017-10-27

Tags: pycon, python

(One of my summaries of a talk at the 2017 PyCon.de conference).

Parallel numpy with Bohrium - Dion Häfner

He had to port a fortran codebase to numpy. Took a few months, but was quite doable. Just some number crunching, so you can do everything with numpy just fine.

For production-running it had to run on parallel hardware. For that he used bohrium, a tool that works just like numpy, but with jit-compiled code.

He showed some numbers: a lot faster.

Cultural data processing with python - Oliver Götze

Cultural data? Catalogs of book archives. Lots of different formats, often proprietary and/or unspecified and with missing data. And with lots of different fields.

He wrote a “data preparation tool” so that they can clean up and transform the data to some generic format at the source.

The power of git - Peer Wagner

What do you think your repositories contain? Code? More!

He read a book about “data forensics”.

git log is ok. But you can pass it arguments so that you can get much more info out of it. You can show which parts of your code are most often edited.

You can also look which files are often committed together and make a relationship graph of it.

He’s started a tool for this: https://github.com/wagnerpeer/gitexplorer, which takes inspiration from Adam Tornhill’s codemaat.

(Update: corrected the project name and attribution)

A travel back in time: xmahjongg - Klaus

http://lcdf.org/xmahjongg/ . The first free version for linux. Also ported to osx.

But... it was only compiled for powerpc processors. And he wasn’t able to re-compile it.

So he re-build it in python: https://github.com/kbr/pyxmahjongg

pip install pyxmahjongg :-)

Test-driven check_mk checks - Thomas Reifenberger

Nagios is a monitoring tool. You have to create all the config by yourself. Or generate it.

You can use check_mk. It uses nagios as a core and generates the config for you.

If you want custom checks, you need both a plugin on the server and a some scriptlet on the target.

You don’t want to test it in production. You want to test it properly. So he build pytest-check-mk to do that.

https://pypi.python.org/pypi/pytest-check-mk/

This way:

  • you’re not testing in production.
  • the feedback cycle is much faster.

Learn to build the internet - Maximilian Wilhelm

If you’re afraid of networking and network setups: last year there were two days of info/introduction on linux networking system. It is in german. They videod everything. And the slides are available.

It is all here: https://blog.sdn.clinic/2017/09/ffrl-routingdays-learn-to-build-the-internet/

https://abload.de/img/screenshot2017-10-13alkup2.png

Photo explanation: some 1:87 scale figures on my model railway (under construction).

PyCon.de: graphql in the python world - Nafiul Islam

2017-10-27

Tags: pycon, python

(One of my summaries of a talk at the 2017 PyCon.de conference).

graphql is a query language for your API. You don’t call the regular REST API and get the standard responses back, but you ask for exactly what you need. You only get the attributes you need.

Graphiql is a graphical explorer for graphql. Github is actually using graphql for its v4 API. He did a demo.

The real question to ask: why graphql over REST?

  • There is a standard. No more fights over the right way to do REST.
  • Development environment (graphiql).
  • You get only what you want/need.
  • Types.

Lots of companies are using it already.

What does python have to offer? graphene. Graphene uses these concepts:

  • Types/objects. More or less serializers.
  • Schema. Collection of objects and mutations. “Your API”.
  • Resolver.
  • Query. What you can ask of the API. “You can search for users by username and by email”.
  • Mutations. Changes you allow to be made. “You can create a new user and you have to pass a username and email”.

He demoed it. It looked really comfortable and slick.

Some small things:

  • 2.0 is out (today!).
  • The django integration is better than the sqlalchemy integration at the moment.
  • There’s middleware for authentication.
  • Join in!
https://abload.de/img/screenshot2017-10-12a0tubc.png

Photo explanation: some 1:87 scale figures on my model railway (under construction).

PyCon.de: the snake in the tar pit, complex systems with python - Stephan Erb

2017-10-27

Tags: pycon, python

(One of my summaries of a talk at the 2017 PyCon.de conference).

He started with an xkcd comic:

https://imgs.xkcd.com/comics/python.png

Often it feels this way at the beginning of a project, later on it gets harder. You cannot just run import greenfield to get back to a green field again.

  • Most engineering time is spend debugging.
  • Most debugging time is spend looking for information.
  • Most time spend looking for information is because the code/system is unfamiliar.

Unfamiliar, unknown code: now we’re talking about team size. You’re probably debugging someone elses code. Or someone else is debugging your code.

What can we do to understand such code? How can we spread the knowledge?

  • You can do informal reasoning. Try to “run the code in your head”. Code reviews. Pair programming.

    By getting better here, we create less bugs.

  • Try to understand it by testing. Treat it as a black box. See what comes out. Add integration tests. Do load tests. Perhaps even chaos engineering.

    By getting better here we find more bugs.

The first is better than the second way, right?

Both get harder when it becomes more complex. Complexity destroys understanding. But I need understanding to have confidence.

Keep in mind the law of leaky abstractions. All non-trivial abstractions are leaky. When something goes wrong, the actual implementation underneath “leaks through”.

Abstractions help us save time building something. It doesn’t save us time learning it afterwards. For python: watch out with dependencies. There are so many nice libraries we could use...

He showed some examples of how to make code more robust. One of the things he used was the pybreaker library, that one looks interesting.

State and mutability: if you have state and want to move from one to the other, there are lots of possibilities and lots of ways in which it can go wrong. If you “just” wipe the state and rebuild it to the new state you want to have, there’s less complexity.

If you pass along a dict, that dict might get modified. if you pass a named tuple, it cannot be modified. So that is less complex.

Keep in mind: complexity breeds complexity.

End-to-end is a networking concept. Try to keep functionality as close to the application as possible. That way, everything in between doesn’t have to deal with the extra functionality/complexity.

He suggested the paper end-to-end arguments in system design (PDF) as reading material. It is very good, even though it is old.

https://abload.de/img/screenshot2017-10-10anal76.png

Photo explanation: some 1:87 scale figures on my model railway (under construction).

PyCon.de: an admin’s cornucopia, python is more than just better bash - Christian Theune

2017-10-27

Tags: pycon, python

(One of my summaries of a talk at the 2017 PyCon.de conference).

A “cornicopia” is a “horn of plenty”. It keeps on giving.

Pragmatism. I can quickly write some python fast for a task I need to do now. And it will be good enough. You can start right away: you don’t need to start designing an architecture beforehand like you’d have to do in java.

Often if you fix something quickly, you’ll have to fix it a second time a day later. With python, you don’t need to write your code twice. Perhaps 1.5 times. You can add tests, you can fix up the code.

What do they use from python?

  • Language features. Decorators, context managers, fstrings, meta programming.
  • Python’s standard library. You get a lot build-in.
  • Releasing. zc.buildout, pip, virtualenv.
  • Testing. pytest, flake8.
  • Lots of external libraries.

Some of these in detail.

  • Context managers. Safely opening and closing files. They had trouble with some corner cases, so they wrote their own context manager that worked with a temporary file and guaranteed you cannot ever see a half-written file.
  • Decorators. Awesome. For instance for automating lock files. Just a (self-written) @locked on a command line function.
  • asyncio. They use it for starting up various backup jobs at the same time. The overhead you have for getting good async code is very low. yield from is really handy. It is just regular python code, you don’t have to think too hard like with other async solutions.
  • Structlog. An external library for creating structured logs. You can also pass along loggers and add information. This can be handy for seeing what happens during execution.
  • expect/pexpect. You can use it for interacting with command line programs. You call them and “expect” a prompt and interact with it.
  • execnet. “Mobile python”. If you can connect to a remote machine and that machine has python? Then you can send over python modules and pickled data. Basically you’re executing your python code on the other machine.
  • cython. They abuse it for a “setuid” script. In such a script, you cannot use a #!/usr/bin/python line. So they use cython to compile the couple-of-lines python script into an actual binary...

He’s been working with python for 18 years now and he’s constantly surprised by the amount of quality software you can use and that you can rely on for a long time.

https://abload.de/img/screenshot2017-10-05alnu1k.png

Photo explanation: some 1:87 scale figures on my model railway (under construction).

PyCon.de: Observing your applications with Sentry and Prometheus - Patrick Mühlbauer

2017-10-27

Tags: pycon, python

(One of my summaries of a talk at the 2017 PyCon.de conference).

Monitoring your applications is important. You can fix problems before they happen. You can quickly pin-point them if they occur anyway. And you should get a good feel for the application through metrics.

There are three ‘pillars of observability’:

  • Logging. Records of individual events that happened.
  • Metrics. Numbers describing a particular process or activity. CPU load, for instance.
  • Tracing. Capture the lifetime of requests as they flow through the various components of a distributed system. (He won’t talk about this).

Error logging

Logging in on a server and searching through logfiles is not much fun. Much better: sentry.

  • It sends notifications for events (mail, slack, etc).
  • It sends them only once. This is important.
  • It aggregates events (statistics, regressions).
  • There are lots of clients for multiple languages (python, javascript, etc.) and platforms (django, flask, angular).
  • It is open source. There is also a software-as-a-service.

He showed the user interface for an error. You see the line in your python code where the error occurred. Traceback. Statistics about browser types, server names, and so on. How often the error occurred already. When it occurred.

It is easy to integrate. He showed it for flask, django and pure python.

Metrics

The tool they use is prometheus. A time series database for system and application metrics. Metrics are things like cpu load and memory consumption.

There are different metric types: counters, gauges, histograms, summaries.

Prometheus normally works by pulling data. There are advantages and disadvantages. There is a push gateway, but they don’t use it.

There’s a python client (prometheus_client). You start its webserver so that prometheus can talk to it. In your code, you can configure the things you want to measure (like “request duration”) and then use them as decorators to decorate your actual functions.

If you want to know more about prometheus, look at https://hynek.me/talks/prometheus/

https://abload.de/img/screenshot2017-10-04aa4pc6.png

Photo explanation: some 1:87 scale figures on my model railway (under construction).

PyCon.de keynote: Artificial intelligence,: differentiating hype and real value - Michael Feindt

2017-10-27

Tags: pycon, python

(One of my summaries of a talk at the 2017 PyCon.de conference).

He’s a physics professor and started out with particle physics. The big experiments like in CERN. Big experiments that also generated Big Data. Think terabyte per second. This was long before the term “Big Data” was invented.

Lots of data, you have to filter out the noise and find the actual signal. There’s a fine balance there: if you are too careful, you’ll never discover anything. If you’re too enthousiastic, you can get wrong results. Wrong results are bad for your physics career, so the methods used were quite conservative. He had to fight to get more modern methods like neural networks accepted.

What is intelligence? Two definitions:

  • The ability to achieve complex goals.
  • Ability to acquire and apply knowledge and skills.

And artificial intelligence?

  • All intelligence that is not biological.

Biology, ok, what is life? “A process that retains its pcomplexity and replicates. DNA is abog 1.2GB. This is the physical life.

Your brain is about 100TB. This is the “software”. Cultural live. It accelerates through teaching, books, technology.

Technological life? That will be when it can design its own hardware and software. He guesses robots/computers will be more intelligent than humans in about 50 years (“the singularity”). He also assumes we’ll improve our own bodies. Improved memory and so. Whether it will be a good thing, he doesn’t know. But it will happen.

What makes machines intelligent? There are two main branches in AI. brute force silicon and intelligent algorithms. The first is Moore’s law in combination with deep neural networks. The second is domain knowledge applied to a field in combination with machine learning. After a while, machines start getting better than humans (for instance google image recognition).

Another example is the computer that beats Go champions. No one can beat it anymore. And the current version learned only by playing against itself...

Superhuman narrow AI (=”accomplish complex goals”) is already achieved

  • Image recognition.
  • Audio translation (text-to-speach).
  • Mental games (chess, jeopardy, go)
  • Atari video games(!). Just by doing it. (“Reinforcement learning)
  • Elementary particle physics research (event reconstruction)
  • Retail business management (supply chain, pricing). Pricing is handled by his company ‘Blue Yonder’. AI is better at pricing products than humans!

Decisions. Many important personal and professional decisions are done by gut feeling: they won’t be automized. “Do I want to marry her?”.

Operation decisions can be automized. This are decisions that are repeated often. Placing orders to replenisch stock. Setting prices. You have to do it often.

How to we currently make such decisions?

  • Almost always you do nothing. Or you do the same as yesterday or last year. Often, this is bad. There are so many bad decisions being made each day...
  • Follow business rules. This is already better.
  • Really thinking about it is rare.

How should you ideally do it? You need data. And a bit of a prediction of the future. Figure out the cost and the utility. Then optimalization. Then automation. “Predictive analysis”.

Predictive analysis isn’t always useful. There’s a spectrum. On the one side purely random processes (lottery numbers). On the side laws of nature. The middle, that’s where predictive analysis comes in.

Predictive analysis as he does it results in a probability distribution, not in a single predicted number. A probability distribution is much better. But the shop owner needs to know how many to order. You can start with utility and cost functions and optimize based on them.

For one supermarket, the out-of-stock rate was 7.5% originally. When they started to follow the AI’s predictions (but with human adjustments), it dropped to 5%. When they decided to do it fully automatic? Below 1%!

But: how nice is it to notice that a computer is better at it than you with your 20 years of supermarket experience???

He showed a couple more nice examples.

He calls it vertical end-to-end AI solutions. Specialized providers combine expertise and experience in solving complicated business problems by narrow AI. More or less an “intelligence layer” on top of existing ERP systems. Make people work smarter.

Back to physics. And to python. He himself originally had to use fortran and punch cards. In high energy physics, they once moved from fortran to c++. This cost a lot of wasted man years. Now many people are moving to python. With numpy, pandas, dask, it is absolutely good enough. It is readable, it is fast, you are productive. He loves it. ‘Dask’ will be very important for the future of python.

Now. AI. Do we want it? He thinks it will be for the good.

https://abload.de/img/screenshot2017-10-04a7eop1.png

Photo explanation: some 1:87 scale figures on my model railway (under construction).

PyCon.de: Python on bare metal, micropython on the pyboard - Christine Spindler

2017-10-27

Tags: pycon, python

(One of my summaries of a talk at the 2017 PyCon.de conference).

(See also yesterday’s talk)

There’s a lot of power and functionality in microcontrollers nowadays. But they are harder and harder to program. Wouldn’t python be a great fit? It allows beginners to do things they couldn’t do before.

Micropython is a powerful and modern language with a large community, especially intended for very constrained/embedded systems.

If you program for embedded systems, you really have to know the hardware. This is different from “regular” programming.

Micropython started with a succesful kickstarter. In 2016, the BBC used it for 7 million school children. There was also a kickstarter for porting it to the super cheap ESP8266 chip.

Fun facts:

  • ESA (European space agency) is sponsoring development to make it even more reliable. They’re planning to use it in satellites.
  • It is certified for use in traffic management devices in the UK!

There were some pyboards and people could play with it. Very nice is that you don’t need an IDE: you can just connect to the board and type around on the python prompt.

https://abload.de/img/screenshot2017-10-03aybx8p.png

Photo explanation: some 1:87 scale figures on my model railway (under construction).

PyCon.de: empowered by Python - Jens Nie and Peer Wagner

2017-10-26

Tags: pycon, python

(One of my summaries of a talk at the 2017 PyCon.de conference).

Jens and Peer are working with pipeline inspections (for Rosen). (Real-world pipelines of up to 1000km long, not software pipelines). They build their own pipeline inspection robots. There’s a lot of measurements coming out of such an inspection. One measurement every millimeter...

So they’re working with big data. And they’re completely based on python. Everything from matplotlib, numpy, scipy, dask. etc.

Also the laboratory measurements use python now. They were used to matlab, but python was much nicer and easier and more powerful.

In the pipeline industry, they invested lots of money and effort in artificial intelligence. But it just did not work. Lots of overfitting. The time was just not right. A large problem was the lack of enough data. They have that now. And with machine learning, they’re getting results.

They also told about the history of their software development process. It started out as word documents that were then implemented. Next phase: prototypes in matlab with re-implementation in python. Only the end-users started to discover the prototypes and started using them anyway....

Now they’re doing everything in python. And prototypes are now more “minimum viable products”. And more and more they’re actually extensions on the base systems.

https://abload.de/img/screenshot2017-10-25ae3jch.png

Photo explanation: picture from our recent cycling holiday (NL+DE). Small stream near Renkum (NL).

 
vanrees.org logo

About me

My name is Reinout van Rees and I work a lot with Python (programming language) and Django (website framework). I live in The Netherlands and I'm happily married to Annie van Rees-Kooiman.

Weblog feeds

Most of my website content is in my weblog. You can keep up to date by subscribing to the automatic feeds (for instance with Google reader):