pandas & Polars: Window Functions vs Group By#

Welcome to this week’s Cameron’s Corner! Before we get started, I want to let you know about our upcoming public seminar series, “(Even More) Python Basics for Experts.” Join James in this three-session series covering (even more) Python basics that any aspiring Python expert needs to know in order to make their code more effective and efficient. He’ll tackle what’s real, how we can tell it’s real, and how we can do less work.

This week, I want to dive back into “window” and “group by” operations. This time, instead of focusing on the SQL syntax, we’ll cover my two favorite DataFrame libraries, pandas and Polars, to discuss the differences in their APIs.

Theoretical Comparison#

Both “window” and “group by” operations are used to perform operations across a subset of rows of a table. These rows are subset based on a unique grouping of values within a column.

For example, the following table has 3 columns: 'sensor', 'date', and 'temperature'. These represent different temperatures collected by different sensors (presumably in different locations) across different days.

import pandas as pd
import polars as pl
from polars import col
import pyarrow as pa
from numpy.random import default_rng
rng = default_rng(8)

pd_df = (
    pd.MultiIndex.from_product(
        iterables=[['a', 'b'], pd.date_range('2000-01-01', periods=10, tz='US/Pacific')],
        names=['sensor', 'date']
    )
    .to_frame(index=False)
    .assign(
        temperature=lambda d: rng.normal(70, scale=8, size=len(d)).astype(int)
    )
    .sample(n=10, random_state=rng)
    .sort_values(by=['sensor', 'date'])
    .reset_index(drop=True)
    .astype({
        'date': pd.ArrowDtype(pa.date64()),
    })
)

pl_df = pl.from_pandas(pd_df).cast({'date': pl.Date()})

display_grid(1, 2, pandas=pd_df, polars=pl_df)

Pandas

sensor date temperature
0 a 2000-01-01 56
1 a 2000-01-03 59
2 a 2000-01-06 68
3 a 2000-01-07 62
4 b 2000-01-01 76
5 b 2000-01-03 76
6 b 2000-01-05 64
7 b 2000-01-07 69
8 b 2000-01-08 81
9 b 2000-01-10 67

Polars

shape: (10, 3)
sensordatetemperature
strdatei64
"a"2000-01-0156
"a"2000-01-0359
"a"2000-01-0668
"a"2000-01-0762
"b"2000-01-0176
"b"2000-01-0376
"b"2000-01-0564
"b"2000-01-0769
"b"2000-01-0881
"b"2000-01-1067

Examining the table above, we can conclude that…

  • The total DataFrame has N rows (10 rows)

  • There exist G groups of unique sensors (2 unique sensors)

N/G represents the DataFrame in its entirety denoting N rows with G unique groups.

I’ll be using this N & G syntax to explain how these transformations mediate the shape of the resultant table.

For these conceptual sections, I’ll be showing both pandas and Polars syntax to highlight where they are similar or different from one another.

Grouped Aggregations#

This transformation is G/N → G

In both pandas and Polars, a “group by” operation followed by an “aggregating” function will ALWAYS return one row per group (just like we saw above). An aggregating function is one that takes an array whose length is >1 and returns 1 element.

One of the large differences between these two interfaces is that pandas has at least four different ways to perform these aggregations depending on whether you want to apply the same operation across all inputted columns or rename them. This is in stark contrast to Polars, which only provides one interface (two if you count renaming columns via keyword arguments instead of .alias).

pandas:

  • DataFrame.groupby(…)[columns].agg(numpy ufunc|user-defined function|str)

  • DataFrame.groupby(…)[columns].min() # or… .max(), .mean(), etc.

    • can also directly call an aggregating function instead of passing it into the .agg interface.

  • DataFrame.groupby(…).agg({column: aggfunc, …})

    • to perform different aggregating functions per column, one can construct a dictionary mapping input columns to strings or functions.

  • DataFrame.groupby(…).agg(new_colum=('column', aggfunc), …)

    • to create a new column name after aggregation while also performing different aggregating functions per column, we would pass named arguments whose value is a tuple of (input-column, aggfunc).

Polars:

  • DataFrame.group_by(…).aggregate(expression-that-aggregates)

Let’s take a look at this in action:

display_grid(
    1, 2,
    #    group on 'sensor', take max of 'temperature'
    pandas=pd_df.groupby('sensor')['temperature'].agg('max').to_frame(),
    polars=pl_df.group_by('sensor').agg(col('temperature').max()),
)

Pandas

temperature
sensor
a 68
b 81

Polars

shape: (2, 2)
sensortemperature
stri64
"b"81
"a"68

Both of these interfaces behave differently if the passed function/expression does NOT aggregate.

  • pandas will error out denoting ValueError: Must produce aggregated value.

  • Polars returns the resultant column as a list (effectively squashing all values from each group).

Note: since the errors reported by these tools can become quite lengthy, I am going to use a snippet to only report the last portion of a traceback.

from traceback import print_exc
from contextlib import contextmanager

@contextmanager
def short_traceback(limit=1):
    try:
        yield
    except Exception:
        print_exc(limit=limit)
def first_2(group):
    return group.head(2)

with short_traceback():
    pd_df.groupby('sensor')['temperature'].agg(first_2).to_frame()
Traceback (most recent call last):
  File "/tmp/ipykernel_110555/332924374.py", line 7, in short_traceback
    yield
ValueError: Must produce aggregated value
pl_df.group_by('sensor').agg(col('temperature').pipe(first_2))
shape: (2, 2)
sensortemperature
strlist[i64]
"b"[76, 76]
"a"[56, 59]

Windowed Transformation#

This transformation is N/G → N/G.

Our window operations are used to operate within each group and return a structure whose size matches that of the original group. Polars and pandas have quite different APIs for this type of operation:

  • pandas: DataFrame.groupby(group).transform(func)

  • Polars: DataFrame.select(expression.over(group)

The Polars syntax strongly represents the equivalent SQL, whereas the pandas syntax builds on the existing groupby interface.

Let’s see these two APIs in action by taking the cumulative maximum that occurs within each group.

display_grid(
    1, 2,
    pandas=(
        pd_df.set_index('sensor') # preserve sensor in our output
        # take the cumulative maximum within each group
        .groupby('sensor')[['temperature']].transform('cummax')
    ),
    polars=pl_df.select(
        col('sensor'),
        col('temperature').cum_max().over('sensor')
    ),
)

Pandas

temperature
sensor
a 56
a 59
a 68
a 68
b 76
b 76
b 76
b 76
b 81
b 81

Polars

shape: (10, 2)
sensortemperature
stri64
"a"56
"a"59
"a"68
"a"68
"b"76
"b"76
"b"76
"b"76
"b"81
"b"81

But, what happens if we try to return a misshapen result from these two interfaces? Much like our aggregating approach where we force pandas to throw an error, can we do the same for a window operation if we return a result whose size does not match that of the original group?

def first_2(group):
    return group.head(2)

pd_df.groupby('sensor')['temperature'].transform(first_2)
0    56.0
1    59.0
2     NaN
3     NaN
4    76.0
5    76.0
6     NaN
7     NaN
8     NaN
9     NaN
Name: temperature, dtype: float64
with short_traceback():
    pl_df.select(col('temperature').pipe(first_2).over('sensor'))
Traceback (most recent call last):
  File "/tmp/ipykernel_110555/332924374.py", line 7, in short_traceback
    yield
polars.exceptions.ComputeError: the length of the window expression did not match that of the group

Error originated in expression: 'col("temperature").slice(offset=0, length=2).over([col("sensor")])'

Now it looks like the roles have reversed: pandas fills in NaNs in the event we did not match the original size of the group, and Polars throws a ComputeError.

But why? Because of index alignment!

I wasn’t kidding when I say that all operations in pandas are index-aligned. The .transform(…) function is evaluated, and its result is index-aligned back to the original group within which it was evaluated.

def first_2(group):
    return group.head(2).values # .values forces us to not use index-alignment

with short_traceback():
    pd_df.groupby('sensor')['temperature'].transform(first_2)
Traceback (most recent call last):
  File "/tmp/ipykernel_110555/332924374.py", line 7, in short_traceback
    yield
ValueError: Length of values (2) does not match length of index (4)

And there we have agreement between these two tools on an interface. Remember, if you’re working in pandas and see NaNs from an operation when you weren’t expecting them, then you’ve forgotten about index alignment.

display_grid(
    1, 2,
    pandas=(
        pd_df.set_index('sensor') # preserve sensor in our output
        # take the cumulative maximum within each group
        .groupby('sensor')[['temperature']].transform('max')
    ),
    polars=pl_df.select(
        col('sensor'),
        col('temperature').max().over('sensor')
    ),
)

Pandas

temperature
sensor
a 68
a 68
a 68
a 68
b 81
b 81
b 81
b 81
b 81
b 81

Polars

shape: (10, 2)
sensortemperature
stri64
"a"68
"a"68
"a"68
"a"68
"b"81
"b"81
"b"81
"b"81
"b"81
"b"81

Wrap-Up#

And there we have it: grouped aggregations and window functions in both pandas and Polars.

What did you think about these approaches? Let me know on the DUTC Discord. Talk to you all next week!