Pandas - What Else Can You .groupby?#

Hey there! Welcome to the first DUTC newsletter of March 2022! We have had an action packed start to the year and are eager to keep the trainings coming. Next month, in March, we are unveiling a new lineup of weekly seminars titled: Confident Queries & Stronger SQL. Where we will help to not only refine your SQL skills, but also but also convey the underlying framework and mental models that power the most commonly used database querying language in the world. And if that isn’t enough to get excited about, then you should be excited for my next presentation where I’ll be comparing Pandas vs SQL to address the similarities and differences between these tools. What types of analyses are possible with either tool, how often do they overlap, and when they do- which one should I use? All of these questions and more will be answered this March! So make sure you register now for our SQL seminar series.

Not only do we have SQL sessions upcoming, but we also have an upcoming Developing Expertise in Python and pandas course this April 18-21! Our developing expertise courses are easily my favorite content we offer. The ability to sit down in a small group and address problems in a paired-programming environment provides the most impactful form of learning. Not only do you get to ask any question about syntax, concepts, and approaches- but you can do so in a safe environment while learning best practices within the PyData stack. If you want to bridge the gap from an intermediate Python programmer to become an expert Pythonista (RUN THIS TERMINOLOGY BY JAMES), then I can not recommend this course enough. We work tirelessly to create a balanced and custom curriculum to meet the goals of all of our attendees.

Speaking of pandas, I want to share a fantastic practice problem (and fun solution) we encountered while working on an upcoming pandas course.

The question#

Given a DataFrame of daily ingredient costs ingredient, and a dictionary of recipes, calculate the daily recipe cost.

from IPython.display import display

from pandas import DataFrame, period_range
from numpy.random import default_rng

rng = default_rng(0)

dates = period_range('2000-01-01', '2000-12-31', freq='D')
ingredients = [
    'bun', 'beef', 'cheese', 'peanut oil',
    'milk', 'potatoes', 'ice cream'
]
costs = (
    rng.uniform(0.05, 0.2, size=(len(dates), len(ingredients)))
    .round(2)
)

daily_ingredient_cost = DataFrame(
    costs, index=dates, columns=ingredients
)

daily_ingredient_cost.head()
bun beef cheese peanut oil milk potatoes ice cream
2000-01-01 0.15 0.09 0.06 0.05 0.17 0.19 0.14
2000-01-02 0.16 0.13 0.19 0.17 0.05 0.18 0.06
2000-01-03 0.16 0.08 0.18 0.13 0.09 0.11 0.05
2000-01-04 0.07 0.15 0.15 0.14 0.11 0.20 0.20
2000-01-05 0.15 0.15 0.15 0.11 0.07 0.16 0.13
recipes = {
    'hamburger': ['beef', 'bun', 'ketchup', 'cheese'],
    'milkshake': ['milk', 'icecream'],
    'french fry': ['potato', 'peanut oil']
}
inverted_recipe = {vv: k for k, v in recipes.items() for vv in v}

daily_recipe_cost = (
    daily_ingredient_cost
    .groupby(inverted_recipe, axis="columns")
    .sum()
)

display(
    inverted_recipe,
    daily_recipe_cost.head()
)
{'beef': 'hamburger',
 'bun': 'hamburger',
 'ketchup': 'hamburger',
 'cheese': 'hamburger',
 'milk': 'milkshake',
 'icecream': 'milkshake',
 'potato': 'french fry',
 'peanut oil': 'french fry'}
french fry hamburger milkshake
2000-01-01 0.05 0.30 0.17
2000-01-02 0.17 0.48 0.05
2000-01-03 0.13 0.42 0.09
2000-01-04 0.14 0.37 0.11
2000-01-05 0.11 0.45 0.07

What I enjoy about this approach is that it uses groupby in a form that I don’t normally come across when reading others code. That is, instead of providing a column to groupby, I’ve provided a mapping. Additionally, I’m grouping across the columns instead of across the rows (as is default). Is this the “right” way to approach this problem? For the problem at hand, I would say yes. However there are some limitations that stem from how the data is stored. Currently, no recipe can use overlapping ingredients- in the data we have no way of telling if a patty was used for a burger or if there is another recipe for a ground beef burrito unless we have access to some other external knowledge.

If you’re unfamiliar with the aforementioned .groupby mechanics, here’s the nuts and bolts of how they work:

.groupby a mapping object#

from pandas import DataFrame, Series

s = Series(range(10), index=[*'abcdefghij'])
mapping = {
    'a': 0, 'b': 0, 'c': 0, 
    'd': 1, 'e': 1, 'f': 1, 
    'g': 2, 'h': 2, 'i': 2, 'j': 2
}

# Using a DataFrame just for display purposes
DataFrame({
    'prev_index': s.index,
    'groupings': s.index.map(mapping)
})
prev_index groupings
0 a 0
1 b 0
2 c 0
3 d 1
4 e 1
5 f 1
6 g 2
7 h 2
8 i 2
9 j 2

.groupby(...) will take a dict or a Series object as input and align it to the calling DataFrame or Series. This alignment can be viewed easily by using either the DataFrame.align or .index.map methods. Once the input is aligned to the calling DataFrame, the groupings are then generated from those aligned values. This enables us to readily transform our groupings without the need of manually creating an explicit grouping column or an external grouping Series. Take a look at the ‘prev_index’ and ‘groupings’ columns to see how transformation is applied to create a groupings from a dictionary.

.groupby on an axis#

from pandas import DataFrame, Series

df = DataFrame(
    data=[[1,2,3],[4,5,6],[7,8,9],[10,11,12]],
    index=[*'abcd'],
    columns=[*'xyz']
)

df
x y z
a 1 2 3
b 4 5 6
c 7 8 9
d 10 11 12
# group row-wise into 2 groups
df.groupby([0, 0, 1, 1]).sum()
x y z
0 5 7 9
1 17 19 21
# group column-wise into 2 groups
df.groupby([0, 1, 1], axis='columns').sum()
0 1
a 1 5
b 4 11
c 7 17
d 10 23

The axis parameter of groupby is useful for performing grouped operations along the columns instead of the rows. In this case, you can see I’m creating 2 groups and producing their summed results either row-wise or column-wise. This is primarily beneficial if you need to aggregate data that is in a wide-format as it can save you compute time by avoiding the need to reshape the data beforehand. Though if I needed to perform an aggregation on both the column and index, I would consider first .stack(...)ing the data to perform my aggregations with one call to .groupby(...).agg(...) instead of performing 2 separate .groupby operations.

You may have also noticed that I’ve also demonstrated another non-label based input into these groupby operations! You can pass arbitrary lists or Series objects (or really any ordered collection that is not a string) into your call to .groupby(...). It’s important to remember that these .groupby(...) approaches aren’t its most common usage- since we typically want to group on column already existant within our data. However, these handy tricks are useful to keep in your toolbelt since we can often avoid producing an intermediate object to keep our code performant!

There you have it! Another week, another Cameron’s Corner! I hope you enjyoed this edition and learned something you didn’t previously know about .groupby. If you want more intensive Python and pandas content like this, make sure you sign up for our upcoming Developing Expertise in Python and pandas! to take your Python and pandas knowledge to the next level! Talk to you all next week!