Pandas - What Else Can You .groupby
?#
Hey there! Welcome to the first DUTC newsletter of March 2022! We have had an action packed start to the year and are eager to keep the trainings coming. Next month, in March, we are unveiling a new lineup of weekly seminars titled: Confident Queries & Stronger SQL. Where we will help to not only refine your SQL skills, but also but also convey the underlying framework and mental models that power the most commonly used database querying language in the world. And if that isn’t enough to get excited about, then you should be excited for my next presentation where I’ll be comparing Pandas vs SQL to address the similarities and differences between these tools. What types of analyses are possible with either tool, how often do they overlap, and when they do- which one should I use? All of these questions and more will be answered this March! So make sure you register now for our SQL seminar series.
Not only do we have SQL sessions upcoming, but we also have an upcoming Developing Expertise in Python and pandas course this April 18-21! Our developing expertise courses are easily my favorite content we offer. The ability to sit down in a small group and address problems in a paired-programming environment provides the most impactful form of learning. Not only do you get to ask any question about syntax, concepts, and approaches- but you can do so in a safe environment while learning best practices within the PyData stack. If you want to bridge the gap from an intermediate Python programmer to become an expert Pythonista (RUN THIS TERMINOLOGY BY JAMES), then I can not recommend this course enough. We work tirelessly to create a balanced and custom curriculum to meet the goals of all of our attendees.
Speaking of pandas, I want to share a fantastic practice problem (and fun solution) we encountered while working on an upcoming pandas course.
The question#
Given a DataFrame of daily ingredient costs ingredient, and a dictionary of recipes, calculate the daily recipe cost.
from IPython.display import display
from pandas import DataFrame, period_range
from numpy.random import default_rng
rng = default_rng(0)
dates = period_range('2000-01-01', '2000-12-31', freq='D')
ingredients = [
'bun', 'beef', 'cheese', 'peanut oil',
'milk', 'potatoes', 'ice cream'
]
costs = (
rng.uniform(0.05, 0.2, size=(len(dates), len(ingredients)))
.round(2)
)
daily_ingredient_cost = DataFrame(
costs, index=dates, columns=ingredients
)
daily_ingredient_cost.head()
bun | beef | cheese | peanut oil | milk | potatoes | ice cream | |
---|---|---|---|---|---|---|---|
2000-01-01 | 0.15 | 0.09 | 0.06 | 0.05 | 0.17 | 0.19 | 0.14 |
2000-01-02 | 0.16 | 0.13 | 0.19 | 0.17 | 0.05 | 0.18 | 0.06 |
2000-01-03 | 0.16 | 0.08 | 0.18 | 0.13 | 0.09 | 0.11 | 0.05 |
2000-01-04 | 0.07 | 0.15 | 0.15 | 0.14 | 0.11 | 0.20 | 0.20 |
2000-01-05 | 0.15 | 0.15 | 0.15 | 0.11 | 0.07 | 0.16 | 0.13 |
recipes = {
'hamburger': ['beef', 'bun', 'ketchup', 'cheese'],
'milkshake': ['milk', 'icecream'],
'french fry': ['potato', 'peanut oil']
}
inverted_recipe = {vv: k for k, v in recipes.items() for vv in v}
daily_recipe_cost = (
daily_ingredient_cost
.groupby(inverted_recipe, axis="columns")
.sum()
)
display(
inverted_recipe,
daily_recipe_cost.head()
)
/tmp/ipykernel_111227/3214625403.py:9: FutureWarning: DataFrame.groupby with axis=1 is deprecated. Do `frame.T.groupby(...)` without axis instead.
daily_ingredient_cost
{'beef': 'hamburger',
'bun': 'hamburger',
'ketchup': 'hamburger',
'cheese': 'hamburger',
'milk': 'milkshake',
'icecream': 'milkshake',
'potato': 'french fry',
'peanut oil': 'french fry'}
french fry | hamburger | milkshake | |
---|---|---|---|
2000-01-01 | 0.05 | 0.30 | 0.17 |
2000-01-02 | 0.17 | 0.48 | 0.05 |
2000-01-03 | 0.13 | 0.42 | 0.09 |
2000-01-04 | 0.14 | 0.37 | 0.11 |
2000-01-05 | 0.11 | 0.45 | 0.07 |
What I enjoy about this approach is that it uses groupby in a form that I don’t normally come across when reading others code. That is, instead of providing a column to groupby, I’ve provided a mapping. Additionally, I’m grouping across the columns instead of across the rows (as is default). Is this the “right” way to approach this problem? For the problem at hand, I would say yes. However there are some limitations that stem from how the data is stored. Currently, no recipe can use overlapping ingredients- in the data we have no way of telling if a patty was used for a burger or if there is another recipe for a ground beef burrito unless we have access to some other external knowledge.
If you’re unfamiliar with the aforementioned .groupby
mechanics, here’s the
nuts and bolts of how they work:
.groupby
a mapping object#
from pandas import DataFrame, Series
s = Series(range(10), index=[*'abcdefghij'])
mapping = {
'a': 0, 'b': 0, 'c': 0,
'd': 1, 'e': 1, 'f': 1,
'g': 2, 'h': 2, 'i': 2, 'j': 2
}
# Using a DataFrame just for display purposes
DataFrame({
'prev_index': s.index,
'groupings': s.index.map(mapping)
})
prev_index | groupings | |
---|---|---|
0 | a | 0 |
1 | b | 0 |
2 | c | 0 |
3 | d | 1 |
4 | e | 1 |
5 | f | 1 |
6 | g | 2 |
7 | h | 2 |
8 | i | 2 |
9 | j | 2 |
.groupby(...)
will take a dict
or a Series
object as input and align
it to the calling DataFrame
or Series
. This alignment can be viewed
easily by using either the DataFrame.align
or .index.map
methods.
Once the input is aligned to the calling DataFrame
, the groupings are then
generated from those aligned values. This enables us to readily transform our
groupings without the need of manually creating an explicit grouping column
or an external grouping Series
. Take a look at the ‘prev_index’
and ‘groupings’ columns to see how transformation is applied to create a
groupings from a dictionary.
.groupby
on an axis#
from pandas import DataFrame, Series
df = DataFrame(
data=[[1,2,3],[4,5,6],[7,8,9],[10,11,12]],
index=[*'abcd'],
columns=[*'xyz']
)
df
x | y | z | |
---|---|---|---|
a | 1 | 2 | 3 |
b | 4 | 5 | 6 |
c | 7 | 8 | 9 |
d | 10 | 11 | 12 |
# group row-wise into 2 groups
df.groupby([0, 0, 1, 1]).sum()
x | y | z | |
---|---|---|---|
0 | 5 | 7 | 9 |
1 | 17 | 19 | 21 |
# group column-wise into 2 groups
df.groupby([0, 1, 1], axis='columns').sum()
/tmp/ipykernel_111227/3316817835.py:2: FutureWarning: DataFrame.groupby with axis=1 is deprecated. Do `frame.T.groupby(...)` without axis instead.
df.groupby([0, 1, 1], axis='columns').sum()
0 | 1 | |
---|---|---|
a | 1 | 5 |
b | 4 | 11 |
c | 7 | 17 |
d | 10 | 23 |
The axis parameter of groupby is useful for performing grouped operations along
the columns instead of the rows. In this case, you can see I’m creating 2
groups and producing their summed results either row-wise or column-wise.
This is primarily beneficial if you need to aggregate data that is in a
wide-format as it can save you compute time by avoiding the need to reshape
the data beforehand. Though if I needed to perform an aggregation on both the
column and index, I would consider first .stack(...)
ing the data to perform
my aggregations with one call to .groupby(...).agg(...)
instead of
performing 2 separate .groupby
operations.
You may have also noticed that I’ve also demonstrated another non-label
based input into these groupby operations! You can pass arbitrary lists or
Series
objects (or really any ordered collection that is not a string) into
your call to .groupby(...)
. It’s important to remember that these .groupby(...)
approaches aren’t its most common usage- since we typically want to group
on column already existant within our data. However, these handy tricks are
useful to keep in your toolbelt since we can often avoid producing an
intermediate object to keep our code performant!
There you have it! Another week, another Cameron’s Corner! I hope you enjyoed
this edition and learned something you didn’t previously know about .groupby
.
If you want more intensive Python and pandas content like this, make sure you
sign up for our upcoming
Developing Expertise in Python and pandas!
to take your Python and pandas knowledge to the next level! Talk to you all next week!