pandas groupby Along the Columns#

Hello, everyone! Welcome back to Cameron’s Corner. This week, I want to discuss a deprecation in the pandas API. Unfortunately, the axis=… argument in pandas.DataFrame.groupby is being deprecated. While it is official, there has been some disagreement within the community on this newest change, primarily because of the conveniences it offers.

But, what is the axis parameter, and what workarounds do we have? Let’s take a look:

Wide vs Long Data#

All analytical users I know are consumers of data; however, they are not always producers of data. The data we ingest will come from a variety of real-world sources with other individuals’ assumptions baked in. This leads to data ingestion being a nonuniform process—some datasets can be “messy” to our standard. A very common source of data cleaning is the classical “wide” vs “long” data (terminology popularized by the R community), with the idea being that features of a single entity can be stored across multiple columns or as a group of rows.

Take this example from https://libguides.princeton.edu/R-reshape:

# wide format
from pandas import DataFrame

grades_wide = DataFrame({
    'student'   : [*'ABC'],
    'math'      : [99,72,12],
    'literature': [45,78,96],
    'PE'        : [56,55,57]
})

grades_wide
student math literature PE
0 A 99 45 56
1 B 72 78 55
2 C 12 96 57

We can readily transform from wide to long format using DataFrame.melt

# long format

grades_long = (
    grades_wide
    .melt(id_vars=['student'], var_name='subject', value_name='score')
)

grades_long
student subject score
0 A math 99
1 B math 72
2 C math 12
3 A literature 45
4 B literature 78
5 C literature 96
6 A PE 56
7 B PE 55
8 C PE 57

…and back using pivot (with additional index information).

grades_long.pivot(index='student', columns='subject', values='score')
subject PE literature math
student
A 56 45 99
B 55 78 72
C 57 96 12

This is great, but datasets—wide or long—are almost never this clean. Let’s focus on some wide data that need to be cleaned before they are analytically ready:

The Data#

We have student grades stored in a wide format just like before, but now we also have grades for students who retook a class.

from pandas import NA

grades = DataFrame({
    'student'     : [*'ABC'],
    'math_1'      : [99, 72, 12],
    'math_2'      : [NA, NA, 63],
    'literature_1': [45, 78, 96],
    'literature_2': [80, 83, NA],
    'PE'          : [56, 55, 57]
})

grades
student math_1 math_2 literature_1 literature_2 PE
0 A 99 <NA> 45 80 56
1 B 72 <NA> 78 83 55
2 C 12 63 96 <NA> 57

Cleaning: One Grade per Student#

For our final analysis, we want to report the highest score for each student for each class (math, literature, PE). With the current format of our data, we can think that we need to group each of our columns together in the following fashion:

groupings = {
    'math'      : ['math_1', 'math_2'],
    'literature': ['literature_1', 'literature_2'],
    'PE'        : ['PE'],
}

But how do we actually achieve this with our data? This is what .groupby(…, axis=1) would help us solve. First, we need to create our groupings.

groupings_map = {v_: k for k, v in groupings.items() for v_ in v}
groupings_map
{'math_1': 'math',
 'math_2': 'math',
 'literature_1': 'literature',
 'literature_2': 'literature',
 'PE': 'PE'}

groupby(…, axis=1)#

In our first approach, let’s use our deprecated argument from .groupby. This implementation first tucks our non-numeric data (student) into the index and operates along each grouped field to get the maximum in each subject.

# now we can perform our groupby operation!
grades.set_index('student').groupby(groupings_map, axis=1).max()

student

PE

literature

math

A

56

80

99

B

55

83

72

C

57

96

63

Fairly straightforward: groupby because we used groupby(…, axis=1). This operation maps our dictionary onto each column name (effectively renaming the columns on our behalf). We then take those renamed columns group along them, taking the maximum across each column.

While we did this with our groupings_map, this can also be done programmatically:

# group column names whose prefix (before the underscore) match
prefix_groups = grades.columns.drop('student').str.split('_').str[0]
grades.set_index('student').groupby(prefix_groups, axis=1).max()

student

PE

literature

math

A

56

80

99

B

55

83

72

C

57

96

63

transpose → groupby#

Since the groupby(…, axis=1) is deprecated, the documentation now recommends using the following pattern:

  1. transpose your data

  2. groupby the transposition

  3. transpose it back

In this case, we have more explicit transposition steps, which can be computationally intensive (reshaping in pandas is not free). But we can still use this pattern to solve the pattern at hand without emitting a DeprecationWarning.

(
    grades.set_index('student')
    .T                             # 1.
    .groupby(groupings_map).max()  # 2.
    .T                             # 3.
)
PE literature math
student
A 56 80 99
B 55 83 72
C 57 96 63

This produces the same result, just at the price of a few more lines of code and some explicit reshaping. This is the documentation-recommended way to recreate the .groupby(…, axis=1) pattern.

However, I am not satisfied with this since reshaping DataFrames is not a zero-cost operation (especially on larger datasets). I wanted to come up with a way that leverages the Python we all know to produce a result.

More Python#

For this solution, I am going to perform the column groupings in Python (we already have them pre-created in a dictionary, which is incredibly convenient). Then I am going to manually iterate across those column groupings and create each part of my result, finally concatenating them together. Breaking the above into discrete steps, we will…

  1. Iterate over the groupings

  2. Calculate our aggregation and store its result

  3. Concatenate the results

# as a reminder, we have the groups already defined
groupings
{'math': ['math_1', 'math_2'],
 'literature': ['literature_1', 'literature_2'],
 'PE': ['PE']}
from pandas import concat
results = {}
for result_name, columns in groupings.items():
    results[result_name] = grades[columns].max(axis=1) # max along the columns
concat(results, axis=1).set_axis(grades['student'])
math literature PE
student
A 99 80 56
B 72 83 55
C 63 96 57

It’s a little bit more code, but we have completely avoided needing to reshape our data! Having the pre-created groups is mighty convenient, so balancing it out here is a good way to derive the groupings from the columns first. This trick will rely on itertools.groupby to get the column groups just right.

from itertools import groupby

column_groupings = groupby(
    sorted([col for col in grades.columns if col not in ['student']]),
    key=lambda name: name.split('_')[0]
)

for prefix, columns in column_groupings:
    print(f'{prefix:<10} {[*columns]}')
PE         ['PE']
literature ['literature_1', 'literature_2']
math       ['math_1', 'math_2']

Now all we need to do is integrate with our Python approach:

from itertools import groupby
from pandas import concat

column_groupings = groupby(
    sorted([col for col in grades.columns if col not in ['student']]),
    key=lambda name: name.split('_')[0]
)

for result_name, columns in column_groupings:
    results[result_name] = grades[[*columns]].max(axis=1)
concat(results, axis=1).set_axis(grades['student'])
math literature PE
student
A 99 80 56
B 72 83 55
C 63 96 57

Why Python Approaches?#

Before I wrap up today’s blog post, I want to address the reason that Python is worth knowing as a language rather than “something I write to work with a DataFrame.” Taking a look at that last approach, we can easily adapt it to Polars while keeping the column_groupings variable unchanged.

from polars import from_pandas, concat, col, max_horizontal
from itertools import groupby

pl_grades = from_pandas(grades)

column_groupings = groupby(
    sorted([col for col in pl_grades.columns if col not in ['student']]),
    key=lambda name: name.split('_')[0]
)

pl_grades.select(
    'student',
    *(
        max_horizontal(col(columns)).alias(name)
        for name, columns in column_groupings
    )
)
shape: (3, 4)
studentPEliteraturemath
stri64i64i64
"A"568099
"B"558372
"C"579663

To make the above adaptation, all I had to do was change the pandas idioms to Polars idioms (e.g., .max(axis=1) to max_horizontal(columns)). The structure of the code stayed the same, while also being fairly cross-compatible across Python libraries.

Wrap-Up#

And there we go, two ways to continue working with groups of columns without needing to rely on .groupby(…, axis=1) in pandas, and even one adapted to Polars.

What do you think about my approach? Anything you’d do differently? Something not making sense? Let me know on the DUTC Discord server.

Talk to you all next week!