Pygrunn: IPython and MongoDB as big data scratchpads - Jens de Smit¶

(One of the summaries of the 2015 Pygrunn conference )

A show of hand: about half the people in the room have used mongodb and half used ipython notebooks. There’s not a lot of overlap.

Jens de Smit works for optiver, a financial company. A “high-frequency trader”, so they use a lot of data and they do a lot of calculations. They do a lot of financial transactions and they need to monitor if they made the right trades.

Trading is now almost exclusively done electronically. Waving hands and shouting on the trading floor at a stock exchange is mostly a thing of the past. Match-making between supply and demand is done centrally. It started 15 years ago. The volume of transactions really exploded. Interesting fact: the response time has gone from 300ms to just 1ms!

So… being fast is important in electronic trading. If you’re slow, you trade at the wrong prices. Trading at the wrong prices means losing money. So speed is important. Just as making the right choices.

What he had to do is to figure out how fast an order was made and wether it was a good order. Non-intrusively. So: what market event did we react to? What was the automatic trade decision (done by an algorithm)? Was it a good one? How long did it all take?

So he monitors data going in and out of their system. He couldn’t change the base system, so: log files, network data and an accounting database. Most of the data is poorly indexed. And a very low signal-to-noise ratio. And of course the logfiles aren’t all consistent. And documentation is bad.

Oh, and the data size is of course also to big to fit in memory :-)

He used mongodb. A schemaless json (well, bson, binary version of json) store. Great for messy data. Easy to use. Just put in a python dictionary, basically. The data is persisted to disk, but as long as you have enough RAM, it’ll keep it in memory. Very fast that way. You get indexes and speedups by default.

After he managed to get everything into mongodb, he had to make sense of things. So: correlate decision logs to network data. This is easy for humans to spot, but hard for computers. Computers are good at exact matches, humans are better at inexact pattern matches.

He used ipython notebook, a nice interactive python shell with a browser interface. Including matplotlib integration for easy graphs. Syntax highlighting; you can render html inside the shell; you can save your work at the end of the day (which you can’t with a regular python shell!); inline editing.

Nice: since last week, rendering such notebooks is supported by github. (I guess he means this announcement ).

Now mongodb. It is very simple to create a directory and start mongodb. If you stop mongo and delete the directory, it is gone as if it was never there. Easy. And with pymongo it is just a few lines of python code and you’re set. Including a handy query language.

He showed a couple of code examples. Looked pretty handy.

Creating an index is a oneliner. If you know beforehand what kinds of queries you want to do, you can quickly create an index for it, which speeds up your queries a lot. You can make complex indexes, but in his experience, simple single-field indexes are often enough.

Something to watch out for: mongo does never return disk space to the OS. If you delete lots of objects, the OS doesn’t get it back unless you shut mongodb down and “repair” the database. What he does is simply delete the database at the end of the day!

He showed one of the outputs: a graph with response times which immediately showed that several responses were too slow. Good, useful information. One year ago he wouldn’t have dreamt of being able to do this sort of analysis.

Mongo is very useful for this kind of work. You use mongodb’s strengths and you aren’t bothered by many of the drawbacks, like missing transactions.

Reinout van Rees

My name is Reinout van Rees and I program in Python, I live in the Netherlands, I cycle recumbent bikes and I have a model railway.

Weblog feeds

Most of my website content is in my weblog. You can keep up to date by subscribing to the automatic feeds (for instance with Google reader):