pandas groupby Along the Columns#
Hello, everyone! Welcome back to Cameron’s Corner. This week, I want to discuss
a deprecation in the pandas API. Unfortunately, the axis=…
argument in
pandas.DataFrame.groupby
is being deprecated. While
it is official, there has been some disagreement within the community on this newest change, primarily because of the conveniences
it offers.
But, what is the axis
parameter, and what workarounds do we have? Let’s take a look:
Wide vs Long Data#
All analytical users I know are consumers of data; however, they are not always producers of data. The data we ingest will come from a variety of real-world sources with other individuals’ assumptions baked in. This leads to data ingestion being a nonuniform process—some datasets can be “messy” to our standard. A very common source of data cleaning is the classical “wide” vs “long” data (terminology popularized by the R community), with the idea being that features of a single entity can be stored across multiple columns or as a group of rows.
Take this example from https://libguides.princeton.edu/R-reshape:
# wide format
from pandas import DataFrame
grades_wide = DataFrame({
'student' : [*'ABC'],
'math' : [99,72,12],
'literature': [45,78,96],
'PE' : [56,55,57]
})
grades_wide
student | math | literature | PE | |
---|---|---|---|---|
0 | A | 99 | 45 | 56 |
1 | B | 72 | 78 | 55 |
2 | C | 12 | 96 | 57 |
We can readily transform from wide to long format using DataFrame.melt
…
# long format
grades_long = (
grades_wide
.melt(id_vars=['student'], var_name='subject', value_name='score')
)
grades_long
student | subject | score | |
---|---|---|---|
0 | A | math | 99 |
1 | B | math | 72 |
2 | C | math | 12 |
3 | A | literature | 45 |
4 | B | literature | 78 |
5 | C | literature | 96 |
6 | A | PE | 56 |
7 | B | PE | 55 |
8 | C | PE | 57 |
…and back using pivot (with additional index information).
grades_long.pivot(index='student', columns='subject', values='score')
subject | PE | literature | math |
---|---|---|---|
student | |||
A | 56 | 45 | 99 |
B | 55 | 78 | 72 |
C | 57 | 96 | 12 |
This is great, but datasets—wide or long—are almost never this clean. Let’s focus on some wide data that need to be cleaned before they are analytically ready:
The Data#
We have student grades stored in a wide format just like before, but now we also have grades for students who retook a class.
from pandas import NA
grades = DataFrame({
'student' : [*'ABC'],
'math_1' : [99, 72, 12],
'math_2' : [NA, NA, 63],
'literature_1': [45, 78, 96],
'literature_2': [80, 83, NA],
'PE' : [56, 55, 57]
})
grades
student | math_1 | math_2 | literature_1 | literature_2 | PE | |
---|---|---|---|---|---|---|
0 | A | 99 | <NA> | 45 | 80 | 56 |
1 | B | 72 | <NA> | 78 | 83 | 55 |
2 | C | 12 | 63 | 96 | <NA> | 57 |
Cleaning: One Grade per Student#
For our final analysis, we want to report the highest score for each student for each class (math, literature, PE). With the current format of our data, we can think that we need to group each of our columns together in the following fashion:
groupings = {
'math' : ['math_1', 'math_2'],
'literature': ['literature_1', 'literature_2'],
'PE' : ['PE'],
}
But how do we actually achieve this with our data? This is what
.groupby(…, axis=1)
would help us solve. First, we need to create our groupings.
groupings_map = {v_: k for k, v in groupings.items() for v_ in v}
groupings_map
{'math_1': 'math',
'math_2': 'math',
'literature_1': 'literature',
'literature_2': 'literature',
'PE': 'PE'}
groupby(…, axis=1)#
In our first approach, let’s use our deprecated argument from .groupby
. This
implementation first tucks our non-numeric data (student) into the index
and operates along each grouped field to get the maximum in each subject.
# now we can perform our groupby operation!
grades.set_index('student').groupby(groupings_map, axis=1).max()
student |
PE |
literature |
math |
---|---|---|---|
A |
56 |
80 |
99 |
B |
55 |
83 |
72 |
C |
57 |
96 |
63 |
Fairly straightforward: groupby
because we used groupby(…, axis=1)
. This operation
maps our dictionary onto each column name (effectively renaming the columns on our behalf).
We then take those renamed columns group along them, taking the maximum across each column.
While we did this with our groupings_map
, this can also be done programmatically:
# group column names whose prefix (before the underscore) match
prefix_groups = grades.columns.drop('student').str.split('_').str[0]
grades.set_index('student').groupby(prefix_groups, axis=1).max()
student |
PE |
literature |
math |
---|---|---|---|
A |
56 |
80 |
99 |
B |
55 |
83 |
72 |
C |
57 |
96 |
63 |
transpose → groupby#
Since the groupby(…, axis=1)
is deprecated, the documentation now recommends using
the following pattern:
transpose your data
groupby
the transpositiontranspose it back
In this case, we have more explicit transposition steps, which can be computationally
intensive (reshaping in pandas is not free). But we can still use this pattern
to solve the pattern at hand without emitting a DeprecationWarning
.
(
grades.set_index('student')
.T # 1.
.groupby(groupings_map).max() # 2.
.T # 3.
)
PE | literature | math | |
---|---|---|---|
student | |||
A | 56 | 80 | 99 |
B | 55 | 83 | 72 |
C | 57 | 96 | 63 |
This produces the same result, just at the price of a few more lines of code
and some explicit reshaping. This is the documentation-recommended way to recreate
the .groupby(…, axis=1)
pattern.
However, I am not satisfied with this since reshaping DataFrames is not a zero-cost operation (especially on larger datasets). I wanted to come up with a way that leverages the Python we all know to produce a result.
More Python#
For this solution, I am going to perform the column groupings in Python (we already have them pre-created in a dictionary, which is incredibly convenient). Then I am going to manually iterate across those column groupings and create each part of my result, finally concatenating them together. Breaking the above into discrete steps, we will…
Iterate over the groupings
Calculate our aggregation and store its result
Concatenate the results
# as a reminder, we have the groups already defined
groupings
{'math': ['math_1', 'math_2'],
'literature': ['literature_1', 'literature_2'],
'PE': ['PE']}
from pandas import concat
results = {}
for result_name, columns in groupings.items():
results[result_name] = grades[columns].max(axis=1) # max along the columns
concat(results, axis=1).set_axis(grades['student'])
math | literature | PE | |
---|---|---|---|
student | |||
A | 99 | 80 | 56 |
B | 72 | 83 | 55 |
C | 63 | 96 | 57 |
It’s a little bit more code, but we have completely avoided needing to reshape our
data! Having the pre-created groups is mighty convenient, so balancing it out here
is a good way to derive the groupings from the columns first. This trick will
rely on itertools.groupby
to get the column groups just right.
from itertools import groupby
column_groupings = groupby(
sorted([col for col in grades.columns if col not in ['student']]),
key=lambda name: name.split('_')[0]
)
for prefix, columns in column_groupings:
print(f'{prefix:<10} {[*columns]}')
PE ['PE']
literature ['literature_1', 'literature_2']
math ['math_1', 'math_2']
Now all we need to do is integrate with our Python approach:
from itertools import groupby
from pandas import concat
column_groupings = groupby(
sorted([col for col in grades.columns if col not in ['student']]),
key=lambda name: name.split('_')[0]
)
for result_name, columns in column_groupings:
results[result_name] = grades[[*columns]].max(axis=1)
concat(results, axis=1).set_axis(grades['student'])
math | literature | PE | |
---|---|---|---|
student | |||
A | 99 | 80 | 56 |
B | 72 | 83 | 55 |
C | 63 | 96 | 57 |
Why Python Approaches?#
Before I wrap up today’s blog post, I want to address the reason that Python
is worth knowing as a language rather than “something I write to work with a DataFrame.”
Taking a look at that last approach, we can easily adapt it to Polars while
keeping the column_groupings
variable unchanged.
from polars import from_pandas, concat, col, max_horizontal
from itertools import groupby
pl_grades = from_pandas(grades)
column_groupings = groupby(
sorted([col for col in pl_grades.columns if col not in ['student']]),
key=lambda name: name.split('_')[0]
)
pl_grades.select(
'student',
*(
max_horizontal(col(columns)).alias(name)
for name, columns in column_groupings
)
)
student | PE | literature | math |
---|---|---|---|
str | i64 | i64 | i64 |
"A" | 56 | 80 | 99 |
"B" | 55 | 83 | 72 |
"C" | 57 | 96 | 63 |
To make the above adaptation, all I had to do was change the pandas idioms to Polars
idioms (e.g., .max(axis=1)
to max_horizontal(columns)
). The structure of the
code stayed the same, while also being fairly cross-compatible across Python libraries.
Wrap-Up#
And there we go, two ways to continue working with groups of columns without
needing to rely on .groupby(…, axis=1)
in pandas, and even one adapted to Polars.
What do you think about my approach? Anything you’d do differently? Something not making sense? Let me know on the DUTC Discord server.
Talk to you all next week!