Table of Contents

Group By Patterns in pandas

Welcome back to Cameron's Corner! This week, I wanted to write on a topic that came up in a recent question on Stack Overflow that I came across. In it, the user wanted to group by a name column, aggregate a few fields, and choose one row per group based on the length of a string column.

While it seems simple on the superficial level, this example brings about a classic case where pandas’ groupby capabilities shine, but also where it’s easy to reach for .apply() and user-defined functions (UDFs) without considering the performance costs. In this post, we’ll walk through the original problem and demonstrate a pattern for refactoring groupby + apply into something more efficient and scalable, especially useful when you’re working with larger datasets. Along the way, we’ll explore where and why performance matters and show how to restructure computations so they align better with pandas’ built-in methods.

The Question

from pandas import DataFrame

pd_df = DataFrame({
    'Name': ['John', 'John', 'Bob', 'Alice', 'Alice', 'Alice'],
    'Source': ['A', 'B', 'B', 'Z', 'Y', 'X'],
    'Description': ['Text1', 'Longer text', 'Text2', 'Longer text', 'The Longest text', 'Text3'],
    'Value': [1, 4, 2, 5, 3, 6]
})

pd_df
Name Source Description Value
0 John A Text1 1
1 John B Longer text 4
2 Bob B Text2 2
3 Alice Z Longer text 5
4 Alice Y The Longest text 3
5 Alice X Text3 6

Our goal is to group by the Name column and...

  1. Concatenate the Source values in sorted order.

  2. For each group, pick the Description and Value from the row with the longest description.

We should expect a result like:

DataFrame({
    'Name': ['Alice', 'Bob', 'John'],
    'Source': ['X, Y, Z', 'B', 'A, B'],
    'Description': ['The Longest text', 'Text2', 'Longer text'],
    'Value': [3, 2, 4],
})
Name Source Description Value
0 Alice X, Y, Z The Longest text 3
1 Bob B Text2 2
2 John A, B Longer text 4

Within the grouping of Name,

  1. sort & join the "Source" column

  2. the "Value" and "Description" from the row with the longest individual description

Naive Solutions

Let’s first write this in the most straightforward way, using .groupby.apply().

def custom_agg(group):
    longest_desc_df = (
        group.assign(_strlen=group['Description'].str.len())
        .nlargest(1, '_strlen')
    )

    return DataFrame({
        'Source': ', '.join(group['Source'].sort_values()),
        'Description': longest_desc_df['Description'],
        'Value': longest_desc_df['Value']
    })

pd_df.groupby('Name').apply(custom_agg, include_groups=False).droplevel(1)
Source Description Value
Name
Alice X, Y, Z The Longest text 3
Bob B Text2 2
John A, B Longer text 4

This works and is intuitive, but there’s a cost: groupby.apply() with UDFs often struggles to scale. Let’s examine what this means at scale.

Scaling Groupby Operations

Here, I'll simulate a larger dataset and compare four common ways of aggregating. Below, I set up some code for microbenchmarking as well as the dataset we‘ll use. We should have ~9,000 unique groups spread across 100,000 rows, which gives us a fairly medium amount of cardinality and data to process.

from contextlib import contextmanager
from time import perf_counter
@contextmanager
def timed(msg):
    start = perf_counter()
    yield
    stop = perf_counter()
    print(f'{msg:<25}{stop - start:.6f}s elapsed')
from numpy import unique
from numpy.random import default_rng
from pandas import DataFrame, Series
from string import ascii_uppercase

rng = default_rng(0)
categories = unique(
    rng.choice([*ascii_uppercase], size=(10_000, length := 4), replace=True)
    .view(f'<U{length}')
)

df = DataFrame({
    'groups' : categories.repeat(reps := 100),
    'values' : rng.normal(100, 10, size=(reps * categories.size)),
}).sample(frac=1, replace=True).reset_index(drop=True)

print(f'{df["groups"].nunique()} unique categories across {df.shape[0]} rows')
df.head()
9874 unique categories across 987400 rows
groups values
0 AWZN 85.156486
1 LMAL 109.013665
2 HKKC 81.790056
3 NXMY 116.955168
4 XLWE 107.381923

Let's compare the simplest operation I can think of: grouped averages. This is easy to implement, but you can perform this operation in many different ways. I'll compare some of the most common ways of writing this computation that I have seen:

  • .groupby.mean()

  • .groupby.agg("mean")

  • .groupby.apply(lambda x: x.mean())

  • Manual for-loop & concatenation

grouped = df.groupby('groups')
grouped.groups # warms up optimization caching

with timed('① groupby.mean()'):
    res1 = grouped['values'].mean()

with timed('② groupby.agg("mean")'):
    res2 = grouped['values'].agg("mean")

with timed('③ groupby.apply(lambda …)'):
    res3 = grouped['values'].apply(lambda x: x.mean())

with timed('④ groupby for-loop'):
    parts = {}
    for name, group in grouped['values']:
        parts[name] = group.mean()
    res4 = Series(parts)

# verify all results produce the same values
from itertools import pairwise
from numpy import allclose
assert all(allclose(x, y) for x, y in pairwise([res1, res2, res3, res4]))
① groupby.mean()         0.005905s elapsed
② groupby.agg("mean")    0.005266s elapsed
③ groupby.apply(lambda …)0.401816s elapsed
④ groupby for-loop       0.391656s elapsed

As you can see approaches 1 & 2 produce similar results while approaches 3 & 4 introduce a massive (50x-100x) slowdown in performance, noting that .groupby(…).apply(UDF) is no faster than a Python for-loop.

Clearly, there is some large benefit to using built-in pandas aggregations in terms of performance. But the answer to the original question from Stack Overflow wasn’t a simple or obvious .mean() operation. It required many different operations strung together. So how do we refactor our existing approach? Let's talk about it.

Refactoring pandas groupby

First of all, if you are working with a small data set with a small number of unique groups (e.g., hundreds of groups and thousands of rows) then refactoring isn't necessary. The refactoring approach discussed below favors optimizations that are most beneficial working with larger datasets. Of course, we’re well below the threshold for "big" data, as that's the point when one would abandon pandas overall.

def custom_agg(group):
    longest_desc_df = (
        group.assign(_strlen=group['Description'].str.len())
        .nlargest(1, '_strlen')
    )

    return DataFrame({
        'Source': ', '.join(group['Source'].sort_values()),
        'Description': longest_desc_df['Description'],
        'Value': longest_desc_df['Value']
    })

pd_df.groupby('Name').apply(custom_agg, include_groups=False).droplevel(1)
Source Description Value
Name
Alice X, Y, Z The Longest text 3
Bob B Text2 2
John A, B Longer text 4

Here's the trick

  • Avoid all user-defined functions (UDFs; like our custom_agg function)

  • Use more specific .agg or .transform methods over .apply

    • Reach .apply if you need to access multiple columns within the grouped computation. Noting that this has a negative impact on overall performance.

So how do you remove the reliance on user-defined functions? There are two types of computations to keep an eye out for when refactoring. You'll want to refactor any computation that…

  1. Is applied equally across all groups (e.g. computing the length of the strings in the "Description" column)

  2. Needs to access to multiple columns within the same group (e.g. Locating the "Value" that corresponds to the maximum length "Description")

Let's review our solution to see if we can spot any areas that can be refactored.

def custom_agg(group):
    longest_desc_loc = (
        group['Description'].str.len() # applied equally regardless of the group
        .idxmax()                      # Series.groupby(…).idxmax(…)
    )

    return DataFrame({ 
        'Source': ', '.join(group['Source'].sort_values()),
        'Description': group.loc[longest_desc_loc, 'Description'],
        'Value': group.loc[longest_desc_loc, 'Value'],
    }, index=[0])

# we’re ultimately aggregating here, use the most appropriate verb `.agg`
# By using this verb we will also avoid accessing multiple columns within the
#   same grouped computation. (`custom_agg` uses "Description", "Source", and "Value").
def naive_pandas(): # put into a function for later comparison
    return pd_df.groupby('Name').apply(custom_agg, include_groups=False).droplevel(1)

naive_pandas()
Source Description Value
Name
Alice X, Y, Z The Longest text 3
Bob B Text2 2
John A, B Longer text 4

Let's get rewriting! Remember that we are aiming to do the following:

  1. Avoid groupby().apply

  2. Avoid most User-defined-functions

  3. Precompute any expression that is applied equally across all groups

  4. Make the specific use of our .groupby verbs: .agg & .transform

def refactored_pandas():
    return (
        pd_df
        .assign( # pre-compute any computation used equally across all groups
            _desc_length=lambda d: d['Description'].str.len(),
        )
        .groupby('Name').agg(
            Source=('Source', lambda g: ', '.join(sorted(g))), # can't avoid a UDF here
            longest_desc_location=('_desc_length', 'idxmax')   # avoided a UDF
        )
        .merge( # fetches the "Description" and "Value" where we observed the longest description
            pd_df.drop(columns=["Name", "Source"]),
            left_on="longest_desc_location",
            right_index=True
        )
        .drop(columns=['longest_desc_location']) # Remove intermediate/temp columns
    )

refactored_pandas()
Source Description Value
Name
Alice X, Y, Z The Longest text 3
Bob B Text2 2
John A, B Longer text 4

This version breaks the problem into orthogonal pieces: compute what you can before grouping, use .agg for the rest, and defer multi-column logic to a final merge. Let's take a look and see how our performance compares to the original.

%timeit -n5 -r5 naive_pandas()
%timeit -n5 -r5 refactored_pandas()
3.24 ms ± 518 μs per loop (mean ± std. dev. of 5 runs, 5 loops each)
4.78 ms ± 235 μs per loop (mean ± std. dev. of 5 runs, 5 loops each)

The timings are just about the same, with a slight advantage (as of the time of writing) towards the naive approach! Remember when I said that refactoring your .groupby operations doesn’t really matter for small datasets with low cardinality? Well, this is our exact circumstance.

Let's see if we can increase the cardinality and size of our data set to see if these timings begin to diverge.

More Data, Higher Cardinality

from random import Random
from numpy import unique
from numpy.random import default_rng
from pandas import DataFrame, Series
from string import ascii_uppercase

rnd = Random(0)
rng = default_rng(0)
categories = unique(
    rng.choice([*ascii_uppercase], size=(10_000, length := 4), replace=True)
    .view(f'<U{length}')
)

pd_df = DataFrame({
    'Name'   : categories.repeat(reps := 100),
    'Source' : rng.choice([*ascii_uppercase], size=(len(categories) * reps, 2)).view('<U2').ravel(),
    'Description' : [
        "".join(rnd.choices(ascii_uppercase, k=rnd.randrange(0, 10)))
        for _ in range(len(categories) * reps)
    ],
    'Value' : rng.integers(0, 1_000_000, size=(reps * categories.size)),
}).sample(frac=1, replace=True).reset_index(drop=True)

print(f'{len(pd_df) = :,} rows')
pd_df.head()
len(pd_df) = 987,400 rows
Name Source Description Value
0 EGMS UG JBNVIILF 985454
1 ORRL CV WZV 214651
2 JVCH KA 304578
3 DOVY ZO RMCIOJV 462253
4 JVGK JO GTFJ 678755
pd_df.apply(lambda g: g.nunique()) # The number of unique values in each column
Name             9874
Source            676
Description    387872
Value          463670
dtype: int64

Now we have closer to 1 million rows with ~10k unique groups based on the "Name" column. Let's rerun our timings to see if anything changes:

%timeit -n1 -r1 naive_pandas()
%timeit -n1 -r1 refactored_pandas()
5.82 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
1.1 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

Now our refactored approach has a ~6x speed up compared to the naive .groupby(…).apply(…) approach taking the computation from ~6 seconds down to ~1 second! That's no small difference, especially considering that I would expect this gap to continue to widen for even larger datasets with higher cardinality.

More Than 1 Solution

With the above said, as you use pandas more you’ll often find that there are many ways to structure a .groupby computation while respecting the aforementioned refactoring considerations. Another answerer on Stack Overflow submitted this solution that I found to be both highly readable and followed all of our refactoring rules! The performance is very similar to what we wrote and trades off our merge operation with a pre-sorting of the data.

%%timeit -n1 -r1

(    
    pd_df
    .sort_values( # sorting is typically expensive, but now we don't need to self join/merge
        "Description", key=lambda g: g.str.len(), ascending=False, kind="mergesort"
    )
    .groupby("Name")
    .agg(
        Source=('Source', lambda g: ', '.join(sorted(g))),
        Description=('Description', 'first'),
        Value=('Value', 'first'),
    )
)
1.35 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

Wrap Up

Problems like this—where you need to select rows based on custom criteria during a groupby—are great reminders of pandas’ flexibility and its traps. It’s tempting to reach for .apply() because it feels so direct, but as datasets grow, performance hits start to matter. The good news is that many of these cases can be refactored into faster, more idiomatic pandas code.

Next time you find yourself writing a UDF inside a groupby, take a step back and ask: Can this be done without passing UDFs? Chances are, the answer is yes—and your future self (and your runtime) will thank you.

Have some pandas code that you want to refactor? Share it with us on the DUTC Discord server.

Table of Contents
Table of Contents