Data science often has a similar workflow: acquire, ingest/clean, store/manage, data wrangling, visual analysis, modeling, story-telling. For many of those stages, python has nice tools.
Christian Staudt calls it an ecosystem. Well, if you make a diagram showing the various tools it starts to look like one of those biology diagrams showing which kinds of animal eats what other kinds of animals. Likewise, python libraries have their function and their specialized niche.
Numpy. The fundamental package for numeric computing in python. N-dimensional arrays. Numpy arrays are different to python lists: they’re layed out in memory in a much more effective and compact way.
Essential for understanding numpy: “lose your loops”. Don’t loop over arrays with regular python operations, but use numpy methods. That pushes everything down into highly effecient compiled code. That can gain you an order of magnitude in performance.
Pandas. Labled indexed array data structures: series, dataframes, timeseries. It also include operations tailored to it like group_by and filter. And handy data import functionality (csv, excel, etc).
He showed a quick example. It pays off to experiment with pandas and to explicitly
Dask. Dask can combine many different pandas dataframes into one. Handy for distributed computing. (Update: Dask can do much more than I originally wrote, like parallel numpy. See Thursday’s keynote about Dask)
Matplotlib. You can create simple plots with just a few lines. With more work, you can get any plot you want out of it.
Seaborn: productionready statistical graphics on top of matplotlib. There’s support for pandas and numpy data structures. And the build-in style looks cooler than matplotlib’s own default :-)
For instance, you can plot multi-dimensional data in multiple plots.
Bokeh. Interactive visualizations. Inspired by D3.js.
Holoviews. A more automated way of creating visualizations. It can use bokeh as a backend. You tell it to relate certain dimensions and it’ll give you a (hopefully matching) visualization.
statsmodels and scikit-learn have many models in common. Scikit-learn has a focus on machine learning and it is more pythonic. Statsmodels is focused on hardcore statistical analysis. And it is more approachable for those coming from R.
Networkx is written in pure python. If you want to do graphs at scale, you need something else like igraph or graph-tool.
ipython. Powerful interactive python shell. He mentioned an extension for doing parallel calculations. And there’s the “rpy2” extension (formerly known as “rmagic”) that lets you use R at the same time from within your ipython shell.
jupyter notebooks. One of his favourite tools for data science projects. Interactive notebooks where you can combine code, documentation and visualizations together. It starts to look like Donald Knuth’s literate programming.
It is language-agnostic with support for R, julia, scala, etc.
You can export your notebook to PDF.
nbdime: diff and merge tools for jupyter notebooks. With git integration. Essential.
jupyterhub: multi-user server for jupyter notebooks. Nice when you’re working with multiple people. You can use it for instance to share the computing resources of one specfic powerful computer.
Photo explanation: simply a picture from my train trip (with a nice planned detour through the Eifel) from Utrecht (NL) to Karlsruhe (DE). Quite new Dutch stop service train.
My name is Reinout van Rees and I work a lot with Python (programming language) and Django (website framework). I live in The Netherlands and I'm happily married to Annie van Rees-Kooiman.
Most of my website content is in my weblog. You can keep up to date by subscribing to the automatic feeds (for instance with Google reader):