(Warning beforehand: I hardly know pandas, so my summary might not be totally correct/useful/complete)
When he started using pandas, differences between dataseries and dataframe tendet to trip him up often.
Series is just like an array. It has a type, as it uses numpy under the hood (“labeled numpy arrays”). It has one type, so a series with ints and floats will be all-floats.
Slicing is just
series.iloc[3:6]. He prefers
.loc as they are more explicit.
A dataframe is a bunch of series with an index (that is also a series). If
you slice, you get rows. If you ask for one item, you get a column. It is
better if you use
A very powerful concept: a boolean index.
sales_data['units'] > 40
gives you an index with everything that sold more than 40 items. You can
or those indexes. Handy for filtering.
Multi-index. Handy for data that is hierarchical (country, towns, etc).
Datetime index. You can use a function to convert timestamps to actual datetimes. Pandas will now treat it correctly, for instance in plots. You can group by years and weeks. Slices work well.
Handy datetime index feature: resampling! You can resample all your various measurements. Like “resample by day and take the maximum”. You’ll then have only one value per day and it will be the maximum of that day. Same for other periods.
Categories: if you have values that are strings, they might actually be categories (‘good’, ‘bad’, ‘worse’). These can be stored much more effectively. And you can use it for further calculations.
Photo explanation: picture from our recent cycling holiday (NL+DE). Rhein bridge of Wesel.
My name is Reinout van Rees and I work a lot with Python (programming language) and Django (website framework). I live in The Netherlands and I'm happily married to Annie van Rees-Kooiman.
Most of my website content is in my weblog. You can keep up to date by subscribing to the automatic feeds (for instance with Google reader):