Group By Patterns in Polars
Hello, everyone, welcome back to Cameron's Corner! Today I wanted to continue the discussion from last week on Groupby Operations in pandas, where we focused on refactoring existing .groupby(…).apply queries. This discussion stemmed from a question on Stack Overflow in which we wanted to perform complex computations within a grouped context.
Rules for Scaling pandas Groupby
We established a few rules for writing pandas groupby operations that scale well:
Favor specific .groupby verbs (.agg and .transform over .apply)
yes, .apply has fast paths that can tap into either .agg or .transform, but knowing when you’ll hit these fastpaths and when you will not can be tricky.
By doing this, you prevent yourself from falling against rule two.
Avoid accessing multiple columns within a single .groupby operation.
If you’re using .agg or .transform for a groupby operation, then you are unable to violate this rule!
Avoid UDFs (user-defined functions). These present an optimization barrier as pandas has no idea what you are attempting to do within your operation.
These rules guide to make the underlying computation within the groupby context as simple as possible. Any time we can pull a computation out of the groupby context such that it can occur either before/after the grouped operation, we should expect to see a speed up in our execution speed.
Observing how these rules apply to the following data:
from contextlib import contextmanager
from time import perf_counter
@contextmanager
def timed(msg):
start = perf_counter()
yield
stop = perf_counter()
print(f'{msg:<35}{stop - start:.3f}s elapsed')
from random import Random
from numpy import unique
from numpy.random import default_rng
from pandas import DataFrame, Series
from string import ascii_uppercase
rnd = Random(0)
rng = default_rng(0)
categories = unique(
rng.choice([*ascii_uppercase], size=(10_000, length := 4), replace=True)
.view(f'<U{length}')
)
pd_df = (
DataFrame({
'Name' : categories.repeat(reps := 100),
'Source' : (
rng.choice([*ascii_uppercase], size=(len(categories) * reps, 2))
.view('<U2').ravel()
),
'Description' : [
"".join(rnd.choices(ascii_uppercase, k=rnd.randrange(5, 20)))
for _ in range(len(categories) * reps)
],
'Value' : rng.integers(0, 1_000_000, size=(reps * categories.size)),
})
.astype({
'Name': 'category',
'Source': 'string[pyarrow]',
'Description': 'string[pyarrow]'
})
.sample(frac=1, replace=True, random_state=rng)
.reset_index(drop=True)
)
print(f'{pd_df["Name"].nunique()} unique Names across {len(pd_df)} rows')
pd_df.head()
9874 unique Names across 987400 rows
| Name | Source | Description | Value | |
|---|---|---|---|---|
| 0 | PZJU | VG | ZCFNOZEAYPFNGWYZXLU | 742935 |
| 1 | JEMR | OA | KMJJDY | 197368 |
| 2 | LJAQ | IX | VVEQO | 412165 |
| 3 | OZNR | SM | TNSCRXBFYARAHU | 918452 |
| 4 | DFWL | RZ | XZHAIIFPATRBG | 794740 |
def custom_agg(group):
longest_desc_df = (
group.assign(_strlen=group['Description'].str.len())
.nlargest(1, '_strlen')
)
return DataFrame({
'Source': ', '.join(group['Source'].sort_values()),
'Description': longest_desc_df['Description'],
'Value': longest_desc_df['Value']
})
with timed('pandas groupby naive'):
res_pd_naive = (
pd_df.groupby('Name', observed=True)
.apply(custom_agg, include_groups=False)
.droplevel(1)
)
pandas groupby naive 22.794s elapsed
with timed('pandas groupby refactored'):
res_pd_refactored = (
pd_df
.assign( # pre-compute any computation used equally across all groups
_desc_length=lambda d: d['Description'].str.len(),
)
.groupby('Name', observed=True).agg(
Source=('Source', lambda g: ', '.join(sorted(g))), # can't avoid a UDF here
longest_desc_location=('_desc_length', 'idxmax') # avoided a UDF
)
.merge( # fetches the "Description" and "Value" where we observed the longest description
pd_df.drop(columns=["Name", "Source"]),
left_on="longest_desc_location",
right_index=True
)
.drop(columns=['longest_desc_location']) # Remove intermediate/temp columns
)
pandas groupby refactored 1.416s elapsed
assert (res_pd_naive == res_pd_refactored).all(axis=None)
Polars: Refactoring Done For You
from polars import col, from_pandas
from polars import selectors as cs
pl_df = from_pandas(pd_df).lazy()
pl_query = (
pl_df
.group_by('Name').agg(
col('Source').sort().str.join(', '),
cs.exclude('Name', 'Source').get(
col('Description').str.len_chars().arg_max()
)
)
)
with timed('polars'):
pl_result = pl_query.collect()
assert pl_result.sort('Name').equals(
from_pandas(res_pd_refactored, include_index=True)
)
polars 0.059s elapsed
We all knew Polars is fast, and even writing our "optimized" pandas code, we don't see execution speed of this magnitude (Polars was ~15x faster than pandas at the time of writing). My favorite part about Polars is that we don't need to re-write or re-factor anything; most of the time, the Polars query optimizer will make those decisions for you to keep your query running as fast as possible!
Let's take a look to see what optimizations Polars applied to our query. We'll need a small helper to help us read through the query plans:
from textwrap import indent, dedent
import re
def format_plan(query_str):
"""re-formatting of the text representation of a Polars query plan.
Note: non-portable, quick and dirty. I wouldn't use this outside of a blog post.
"""
pattern = r'\[(.*?)\]'
replacement = lambda m: dedent(f'''
[
{"\n ".join(part.strip() for part in m.group(1).split(","))}
]''').strip()
return re.sub(pattern, replacement, query_str)
While Polars doesn't expose its internal query representation to the Python level, we can visually investigate the result of calling LazyFrame.explain() to read the query before/after it has passed through the query optimizer.
Take a look at the following two query plans: can you spot the difference?
print(
format_plan(pl_query.explain(optimized=False))
)
AGGREGATE
[
col("Source").sort(asc).str.concat_vertical()
col("Description").get(col("Description").str.len_chars().arg_max())
col("Value").get(col("Description").str.len_chars().arg_max())
] BY [
col("Name")
] FROM
DF [
"Name"
"Source"
"Description"
"Value"
]; PROJECT */4 COLUMNS; SELECTION: None
print(
format_plan(pl_query.explain())
)
AGGREGATE
[
col("Source").sort(asc).str.concat_vertical()
col("Description").get(col("__POLARS_CSER_0x443fe19597235056").arg_max()).alias("Description")
col("Value").get(col("__POLARS_CSER_0x443fe19597235056").arg_max()).alias("Value")
] BY [
col("Name")
] FROM
WITH_COLUMNS:
[
col("Description").str.len_chars().alias("__POLARS_CSER_0x443fe19597235056")
]
DF [
"Name"
"Source"
"Description"
"Value"
]; PROJECT 4/4 COLUMNS; SELECTION: None
Looking closely at the top query plan, you can see that Polars moves the expression col("Description").str.len_chars() outside of the Aggregate context, which is exactly the optimization we made by hand in pandas! Since Polars has a query optimizer and its expressions are entirely transparent, Polars can re-structure our query to the plan that most likely produces the desired result the fastest. This is something that our pandas code can’t do and we trust our developers themselves to write code that produces near-optimal performance.
So Polars is Just That Much Faster?
Well… Yes and no. Polars is undoubtedly fast, but at the same time, I feel that pandas isn't as slow as its reputation makes it out to be. In fact, when I refactored our original pandas code, there was one operation that I didn't pay much mind to. The only UDF that we were unable to remove from the .groupby evaluation context: Source=('Source', lambda g: ', '.join(sorted(g))). As it turns out, if we remove this from our computation, our pandas code gets much closer to our Polars code in terms of execution speed. I don't expect this change to beat the Polars implementation by any means but I did want to highlight that pandas isn't slow, it just has some slow operations.
with timed('pandas groupby (no udf)'):
res_pd_refactored = (
pd_df
.assign( # pre-compute any computation used equally across all groups
_desc_length=lambda d: d['Description'].str.len(),
)
.groupby('Name', observed=True).agg(
# This computation is a pure UDF, what happens if we skip it?
# Source=('Source', lambda g: ', '.join(sorted(g))), # can't avoid a UDF here
longest_desc_location=('_desc_length', 'mean') # avoided a UDF
)
.merge( # fetches the "Description" and "Value" where we observed the longest description
pd_df.drop(columns=["Name", "Source"]),
left_on="longest_desc_location",
right_index=True
)
.drop(columns=['longest_desc_location']) # Remove intermediate/temp columns
)
pandas groupby (no udf) 0.109s elapsed
Wrap-Up
Where pandas requires you to carefully extract computations from the grouped context and avoid performance traps like UDFs, Polars’ expression system and query optimizer enable it to handle much of that complexity behind the scenes.
The takeaway? Pandas isn’t inherently slow, it just demands a bit more care when writing performant code. Polars, on the other hand, gives you performance and peace of mind, especially for complex data transformations.
Next time you're reaching for groupby, ask yourself:
Can I rewrite this to avoid .apply?
Are there computations I can move outside the grouped context?
Should I try this in Polars and see what it comes up with?
Thanks for reading! Talk to you all again next week, where we’ll dig into even more patterns for writing fast, expressive data code in Python.
What do you think about pandas vs. Polars? Let me know on the DUTC Discord server.