(One of my summaries of a talk at the 2017 PyCon.de conference).
Matthew Rocklin works on dask for anaconda. He showed a demo.
Python has a mature analytics stack. numpy, pandas, etcetera. These have one big drawback: they are designed to run in RAM on a single machine. Now… how can we parallellize it? And not only numpy and pandas, but also the libraries that build upon it.
What can you do in Python:
Embarrassingly parallel: mulitiprocessing, for instance.
You can use the multiprocessing library output = map(func, data)
becomes output = pool.map(func, data)
. This is the simplest case. Often
it is enough!
Big data collections: spark, SQL, linear algebra. It manages parallelism for you within a fixed algorithm. If you stick to one of those paradigms, you can run on a cluster. Problem solved.
Task schedulers: airflow, celery. You define a graph of python functions with data dependencies between them. The task scheduler then runs those functions on parallel hardware.
How do these solutions hold up
Multiprocessing is included with python and well-known. That’s good. But it isn’t efficient for most of the specific scientific algorithms.
Big data: heavy-weight dependency. You’ll have to get everyone to choose Spark, for instance….
Task schedulers: they’re not optimized for the kinds of computational workloads we need. They’re aimed at ETL. And they’re not very good at exchanging data between tasks, which is often needed in scientific programming.
Enter “dask”.
It is a flexible task scheduler in the core.
Co-developed with pandas/sklearn/jupyter.
It is designed for computational workloads.
It scales from a laptop with lots of threads to multiple machines in a cluster.
It helps you create (and visualize!) a function graph and it helps you schedule and distribute those tasks.
He again demonstrated it on his laptop. Impressive. He added a bunch of extra workers halfway the calculation and they got to work right away.
They’ve build high-level libraries with dask: dask + numpy = dask.array. dask + pandas = dask.dataframe. One dask array is a large multi-dimensional array consisting of multiple numpy arrays on several computers. Same with dash.dataframe for huge timeseries.
Regular map/reduce workloads can handle the simple cases just fine. But once you start working with huge multidimensional arrays with complex operations… it just gets too complicated. That is where dask comes in.
He showed some work-in-progress unstable stuff too. Not ready for production! The api can change! The one that I wrote down was about geopandas, geospatial analysis. It combines pandas’ data operations with the generic geometrical libraries that for instance postgis uses. Only… it is much slower as it only runs on one core. They recently rewrote it in cython, which made it just as quick.
Now dash.geopandas. With this you can partition data geometrically. This gives you quite a performance boost. The ease of geopandas with a great speed.
The work is not finished yet. They implemented some of the hard algorithms (spatial joins, for instance), but left out the easy ones. If you want to contribute: there’s quite some impact you can have here.
Photo explanation: picture from our recent cycling holiday (NL+DE). Camping in an old Dutch fortress.
My name is Reinout van Rees and I program in Python, I live in the Netherlands, I cycle recumbent bikes and I have a model railway.
Most of my website content is in my weblog. You can keep up to date by subscribing to the automatic feeds (for instance with Google reader):