Group By Patterns in pandas

Welcome back to Cameron's Corner! This week, I wanted to write on a topic that came up in a recent question on Stack Overflow that I came across. In it, the user wanted to group by a name column, aggregate a few fields, and choose one row per group based on the length of a string column.

While it seems simple on the superficial level, this example brings about a classic case where pandas’ groupby capabilities shine, but also where it’s easy to reach for .apply() and user-defined functions (UDFs) without considering the performance costs. In this post, we’ll walk through the original problem and demonstrate a pattern for refactoring groupby + apply into something more efficient and scalable, especially useful when you’re working with larger datasets. Along the way, we’ll explore where and why performance matters and show how to restructure computations so they align better with pandas’ built-in methods.

The Question

from pandas import DataFrame

pd_df = DataFrame({
    'Name': ['John', 'John', 'Bob', 'Alice', 'Alice', 'Alice'],
    'Source': ['A', 'B', 'B', 'Z', 'Y', 'X'],
    'Description': ['Text1', 'Longer text', 'Text2', 'Longer text', 'The Longest text', 'Text3'],
    'Value': [1, 4, 2, 5, 3, 6]
})

pd_df

	Name	Source	Description	Value
0	John	A	Text1	1
1	John	B	Longer text	4
2	Bob	B	Text2	2
3	Alice	Z	Longer text	5
4	Alice	Y	The Longest text	3
5	Alice	X	Text3	6

Our goal is to group by the Name column and...

Concatenate the Source values in sorted order.
For each group, pick the Description and Value from the row with the longest description.

We should expect a result like:

DataFrame({
    'Name': ['Alice', 'Bob', 'John'],
    'Source': ['X, Y, Z', 'B', 'A, B'],
    'Description': ['The Longest text', 'Text2', 'Longer text'],
    'Value': [3, 2, 4],
})

	Name	Source	Description	Value
0	Alice	X, Y, Z	The Longest text	3
1	Bob	B	Text2	2
2	John	A, B	Longer text	4

Within the grouping of Name,

sort & join the "Source" column
the "Value" and "Description" from the row with the longest individual description

Naive Solutions

Let’s first write this in the most straightforward way, using .groupby.apply().

def custom_agg(group):
    longest_desc_df = (
        group.assign(_strlen=group['Description'].str.len())
        .nlargest(1, '_strlen')
    )

    return DataFrame({
        'Source': ', '.join(group['Source'].sort_values()),
        'Description': longest_desc_df['Description'],
        'Value': longest_desc_df['Value']
    })

pd_df.groupby('Name').apply(custom_agg, include_groups=False).droplevel(1)

	Source	Description	Value
Name
Alice	X, Y, Z	The Longest text	3
Bob	B	Text2	2
John	A, B	Longer text	4

This works and is intuitive, but there’s a cost: groupby.apply() with UDFs often struggles to scale. Let’s examine what this means at scale.

Scaling Groupby Operations

Here, I'll simulate a larger dataset and compare four common ways of aggregating. Below, I set up some code for microbenchmarking as well as the dataset we‘ll use. We should have ~9,000 unique groups spread across 100,000 rows, which gives us a fairly medium amount of cardinality and data to process.

from contextlib import contextmanager
from time import perf_counter
@contextmanager
def timed(msg):
    start = perf_counter()
    yield
    stop = perf_counter()
    print(f'{msg:<25}{stop - start:.6f}s elapsed')

from numpy import unique
from numpy.random import default_rng
from pandas import DataFrame, Series
from string import ascii_uppercase

rng = default_rng(0)
categories = unique(
    rng.choice([*ascii_uppercase], size=(10_000, length := 4), replace=True)
    .view(f'<U{length}')
)

df = DataFrame({
    'groups' : categories.repeat(reps := 100),
    'values' : rng.normal(100, 10, size=(reps * categories.size)),
}).sample(frac=1, replace=True).reset_index(drop=True)

print(f'{df["groups"].nunique()} unique categories across {df.shape[0]} rows')
df.head()

9874 unique categories across 987400 rows

	groups	values
0	AWZN	85.156486
1	LMAL	109.013665
2	HKKC	81.790056
3	NXMY	116.955168
4	XLWE	107.381923

Let's compare the simplest operation I can think of: grouped averages. This is easy to implement, but you can perform this operation in many different ways. I'll compare some of the most common ways of writing this computation that I have seen:

.groupby.mean()
.groupby.agg("mean")
.groupby.apply(lambda x: x.mean())
Manual for-loop & concatenation

grouped = df.groupby('groups')
grouped.groups # warms up optimization caching

with timed('① groupby.mean()'):
    res1 = grouped['values'].mean()

with timed('② groupby.agg("mean")'):
    res2 = grouped['values'].agg("mean")

with timed('③ groupby.apply(lambda …)'):
    res3 = grouped['values'].apply(lambda x: x.mean())

with timed('④ groupby for-loop'):
    parts = {}
    for name, group in grouped['values']:
        parts[name] = group.mean()
    res4 = Series(parts)

# verify all results produce the same values
from itertools import pairwise
from numpy import allclose
assert all(allclose(x, y) for x, y in pairwise([res1, res2, res3, res4]))

① groupby.mean()         0.005905s elapsed
② groupby.agg("mean")    0.005266s elapsed

③ groupby.apply(lambda …)0.401816s elapsed

④ groupby for-loop       0.391656s elapsed

As you can see approaches 1 & 2 produce similar results while approaches 3 & 4 introduce a massive (50x-100x) slowdown in performance, noting that .groupby(…).apply(UDF) is no faster than a Python for-loop.

Clearly, there is some large benefit to using built-in pandas aggregations in terms of performance. But the answer to the original question from Stack Overflow wasn’t a simple or obvious .mean() operation. It required many different operations strung together. So how do we refactor our existing approach? Let's talk about it.

Refactoring pandas groupby

First of all, if you are working with a small data set with a small number of unique groups (e.g., hundreds of groups and thousands of rows) then refactoring isn't necessary. The refactoring approach discussed below favors optimizations that are most beneficial working with larger datasets. Of course, we’re well below the threshold for "big" data, as that's the point when one would abandon pandas overall.

def custom_agg(group):
    longest_desc_df = (
        group.assign(_strlen=group['Description'].str.len())
        .nlargest(1, '_strlen')
    )

    return DataFrame({
        'Source': ', '.join(group['Source'].sort_values()),
        'Description': longest_desc_df['Description'],
        'Value': longest_desc_df['Value']
    })

pd_df.groupby('Name').apply(custom_agg, include_groups=False).droplevel(1)

	Source	Description	Value
Name
Alice	X, Y, Z	The Longest text	3
Bob	B	Text2	2
John	A, B	Longer text	4

Here's the trick

Avoid all user-defined functions (UDFs; like our custom_agg function)
Use more specific .agg or .transform methods over .apply
- Reach .apply if you need to access multiple columns within the grouped computation. Noting that this has a negative impact on overall performance.

So how do you remove the reliance on user-defined functions? There are two types of computations to keep an eye out for when refactoring. You'll want to refactor any computation that…

Is applied equally across all groups (e.g. computing the length of the strings in the "Description" column)
Needs to access to multiple columns within the same group (e.g. Locating the "Value" that corresponds to the maximum length "Description")

Let's review our solution to see if we can spot any areas that can be refactored.

def custom_agg(group):
    longest_desc_loc = (
        group['Description'].str.len() # applied equally regardless of the group
        .idxmax()                      # Series.groupby(…).idxmax(…)
    )

    return DataFrame({ 
        'Source': ', '.join(group['Source'].sort_values()),
        'Description': group.loc[longest_desc_loc, 'Description'],
        'Value': group.loc[longest_desc_loc, 'Value'],
    }, index=[0])

# we’re ultimately aggregating here, use the most appropriate verb `.agg`
# By using this verb we will also avoid accessing multiple columns within the
#   same grouped computation. (`custom_agg` uses "Description", "Source", and "Value").
def naive_pandas(): # put into a function for later comparison
    return pd_df.groupby('Name').apply(custom_agg, include_groups=False).droplevel(1)

naive_pandas()

	Source	Description	Value
Name
Alice	X, Y, Z	The Longest text	3
Bob	B	Text2	2
John	A, B	Longer text	4

Let's get rewriting! Remember that we are aiming to do the following:

Avoid groupby().apply
Avoid most User-defined-functions
Precompute any expression that is applied equally across all groups
Make the specific use of our .groupby verbs: .agg & .transform

def refactored_pandas():
    return (
        pd_df
        .assign( # pre-compute any computation used equally across all groups
            _desc_length=lambda d: d['Description'].str.len(),
        )
        .groupby('Name').agg(
            Source=('Source', lambda g: ', '.join(sorted(g))), # can't avoid a UDF here
            longest_desc_location=('_desc_length', 'idxmax')   # avoided a UDF
        )
        .merge( # fetches the "Description" and "Value" where we observed the longest description
            pd_df.drop(columns=["Name", "Source"]),
            left_on="longest_desc_location",
            right_index=True
        )
        .drop(columns=['longest_desc_location']) # Remove intermediate/temp columns
    )

refactored_pandas()

	Source	Description	Value
Name
Alice	X, Y, Z	The Longest text	3
Bob	B	Text2	2
John	A, B	Longer text	4

This version breaks the problem into orthogonal pieces: compute what you can before grouping, use .agg for the rest, and defer multi-column logic to a final merge. Let's take a look and see how our performance compares to the original.

%timeit -n5 -r5 naive_pandas()
%timeit -n5 -r5 refactored_pandas()

3.24 ms ± 518 μs per loop (mean ± std. dev. of 5 runs, 5 loops each)

4.78 ms ± 235 μs per loop (mean ± std. dev. of 5 runs, 5 loops each)

The timings are just about the same, with a slight advantage (as of the time of writing) towards the naive approach! Remember when I said that refactoring your .groupby operations doesn’t really matter for small datasets with low cardinality? Well, this is our exact circumstance.

Let's see if we can increase the cardinality and size of our data set to see if these timings begin to diverge.

More Data, Higher Cardinality

from random import Random
from numpy import unique
from numpy.random import default_rng
from pandas import DataFrame, Series
from string import ascii_uppercase

rnd = Random(0)
rng = default_rng(0)
categories = unique(
    rng.choice([*ascii_uppercase], size=(10_000, length := 4), replace=True)
    .view(f'<U{length}')
)

pd_df = DataFrame({
    'Name'   : categories.repeat(reps := 100),
    'Source' : rng.choice([*ascii_uppercase], size=(len(categories) * reps, 2)).view('<U2').ravel(),
    'Description' : [
        "".join(rnd.choices(ascii_uppercase, k=rnd.randrange(0, 10)))
        for _ in range(len(categories) * reps)
    ],
    'Value' : rng.integers(0, 1_000_000, size=(reps * categories.size)),
}).sample(frac=1, replace=True).reset_index(drop=True)

print(f'{len(pd_df) = :,} rows')
pd_df.head()

len(pd_df) = 987,400 rows

	Name	Source	Description	Value
0	EGMS	UG	JBNVIILF	985454
1	ORRL	CV	WZV	214651
2	JVCH	KA		304578
3	DOVY	ZO	RMCIOJV	462253
4	JVGK	JO	GTFJ	678755

pd_df.apply(lambda g: g.nunique()) # The number of unique values in each column

Name             9874
Source            676
Description    387872
Value          463670
dtype: int64

Now we have closer to 1 million rows with ~10k unique groups based on the "Name" column. Let's rerun our timings to see if anything changes:

%timeit -n1 -r1 naive_pandas()
%timeit -n1 -r1 refactored_pandas()

5.82 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

1.1 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

Now our refactored approach has a ~6x speed up compared to the naive .groupby(…).apply(…) approach taking the computation from ~6 seconds down to ~1 second! That's no small difference, especially considering that I would expect this gap to continue to widen for even larger datasets with higher cardinality.

Wrap Up

Problems like this—where you need to select rows based on custom criteria during a groupby—are great reminders of pandas’ flexibility and its traps. It’s tempting to reach for .apply() because it feels so direct, but as datasets grow, performance hits start to matter. The good news is that many of these cases can be refactored into faster, more idiomatic pandas code.

Next time you find yourself writing a UDF inside a groupby, take a step back and ask: Can this be done without passing UDFs? Chances are, the answer is yes—and your future self (and your runtime) will thank you.

Have some pandas code that you want to refactor? Share it with us on the DUTC Discord server.