Utrecht (NL) python meetup september 2018

Tags: python

Data processing using parser combinators - Werner de Groot

He collaborated with data scientists from Wageningen University. The scientists did lots of cool programming stuff. But they did not use version control, so they introduced git :-) They also had lots and lots of data, so they introduced Apache Spark.

Their data sets were in ascii files, which are huge. So the ascii files need to be parsed. He showed an example of a file with DNA data. Ouch, it turns out to be pretty complex because there are quite some exceptions. Fields (in this example) are separated by semicolons. But some of the values also contain semicolons (quoted, though). So the generic python code they used to parse their DNA data was littered with “if” statements. Unmaintainable.

You probably heard that hard problems need to be split up into smaller problems. Yes, that’s true. But the smaller parts also need to be combined again. So: smaller parsers + a combining them again.

A parser takes an input and returns the part that matched and the part that remained.

He showed parsy as a library for “parser combinators”. Many languages have such libraries. He demonstrated how to combine those input/match/remainder parsers into a kind of pipeline/sequence. Such a sequence of parsers can be treated as a parser in its own right. This makes designing and nesting them easy.

When combining parsers, you of course need to handle variants: an “or” operator handles that.

Someone asked about “yacc” parsers, which are able to handle robustly handle each and every corner case, “how does it compare to the simpler ‘parsy’”. The answer: parsy is a simple library, there are more elaborate python libraries. But: parsy is already quite good. A json parser written in parsy takes only 50 lines!

He did a live demo, developing a set of combined parser step by step. Fun! And very understandable. So “parsy” sounds like a nice library for this kind of work.

There were some comparison-questions to regular expressions. Werner’s answer was that parsy’s kind of parsers are much more readable and debuggable. He was surprised at the amount of attendees that like regular expressions :-)

The nice thing: every individual part of your parser (“just some numbers”, “an equals sign”) is a piece of python, so you can give it a name. This way, you can give those pieces of parser names from your domain like dnaName, type, customer).

In the end, he live-coded the whole DNA ascii file parser. Quite boring. And that was his whole point: what would be hard or impossible to do in plain python becomes “just boring” with parsy. Exactly what we want!

A practical application of Python metaclasses - Jan-Hein Bührman

(See an earlier summary about metaclasses being used in django)

Apart from metaclasses, he showed some utilities that he likes: pipenv, pylint, mypy.

A nice touch to his presentation: he had his example code all in separate branches. Instead of live coding, he just switched branches all the time. Because he gave his branches clear names, it worked quite well!

The example he build up is impossible to summarize here. The example included a register function that he had to call on certain classes. He didn’t like it. That’s where metaclasses come in.

Python objects are instances of classes. Classes themselves are instances of type. You can create classes programmatically by doing something like:

>>> B = type('B', (), {})
>>> b = B()
>>> type(b)
<class 'B'>

Normally, when python imports a module (= reads a python file), class statements are executed and the class is created. You can influence that process by adding a __new__ method.

  • __init__() influences the creation of objects from the class (= instantiating the object from the class).
  • __new__() influences the creation of the class (= instantiating the class from type).

He used it to automatically register objects created from classes with the metaclass.

Note: in python 3.6, __init_subclass__() was added that really makes this much easier.

blog comments powered by Disqus
 
vanrees.org logo

About me

My name is Reinout van Rees and I work a lot with Python (programming language) and Django (website framework). I live in The Netherlands and I'm happily married to Annie van Rees-Kooiman.

Weblog feeds

Most of my website content is in my weblog. You can keep up to date by subscribing to the automatic feeds (for instance with Google reader):