(One of my summaries of a talk at the 2017 PyCon.de conference).
(Warning beforehand: I hardly know pandas, so my summary might not be totally correct/useful/complete)
When he started using pandas, differences between dataseries and dataframe tendet to trip him up often.
Series is just like an array. It has a type, as it uses numpy under the hood (“labeled numpy arrays”). It has one type, so a series with ints and floats will be all-floats.
Slicing is just series[3:6]
or series.iloc[3:6]
. He prefers
.iloc[]
and .loc[]
as they are more explicit.
A dataframe is a bunch of series with an index (that is also a series). If
you slice, you get rows. If you ask for one item, you get a column. It is
better if you use .iloc[]
.
A very powerful concept: a boolean index. sales_data['units'] > 40
gives you an index with everything that sold more than 40 items. You can
and
and or
those indexes. Handy for filtering.
Multi-index. Handy for data that is hierarchical (country, towns, etc).
Datetime index. You can use a function to convert timestamps to actual datetimes. Pandas will now treat it correctly, for instance in plots. You can group by years and weeks. Slices work well.
Handy datetime index feature: resampling! You can resample all your various measurements. Like “resample by day and take the maximum”. You’ll then have only one value per day and it will be the maximum of that day. Same for other periods.
Categories: if you have values that are strings, they might actually be categories (‘good’, ‘bad’, ‘worse’). These can be stored much more effectively. And you can use it for further calculations.
Photo explanation: picture from our recent cycling holiday (NL+DE). Rhein bridge of Wesel.
My name is Reinout van Rees and I program in Python, I live in the Netherlands, I cycle recumbent bikes and I have a model railway.
Most of my website content is in my weblog. You can keep up to date by subscribing to the automatic feeds (for instance with Google reader):