Tables: Window Functions vs Group By

Hello, everyone! This week, I want to dive into "window" and "group by." What's the difference? When should you use one over the other? Let's take a look.

Theoretical Comparison

Both window and group by functions are used to perform operations across a subset of rows of a table. These rows are subsetted based on a unique grouping of values within a column.

For example, the following table has 3 columns: 'sensor', 'date', and 'temperature'. These represent temperatures collected by sensors (presumably in different locations) across multiple days.

import pandas as pd
import pyarrow as pa
from numpy.random import default_rng
rng = default_rng(8)

df = (
    pd.MultiIndex.from_product(
        iterables=[
            ['a', 'b'],
            pd.date_range('2000-01-01', periods=10, tz='US/Pacific')
        ],
        names=['sensor', 'date']
    )
    .to_frame(index=False)
    .assign(
        temperature=lambda d: rng.normal(70, scale=8, size=len(d)).astype(int)
    )
    .sample(n=10, random_state=rng)
    .sort_values(by=['sensor', 'date'])
    .reset_index(drop=True)
    .astype({
        'date': pd.ArrowDtype(pa.date64()),
    })
)

df

	sensor	date	temperature
0	a	2000-01-01	56
1	a	2000-01-03	59
2	a	2000-01-06	68
3	a	2000-01-07	62
4	b	2000-01-01	76
5	b	2000-01-03	76
6	b	2000-01-05	64
7	b	2000-01-07	69
8	b	2000-01-08	81
9	b	2000-01-10	67

Examining the table above, we can conclude that

The total DataFrame has N rows (10 rows)
There exist G groups of unique sensors (2 unique sensors)

G/N represents the DataFrame in its entirety, denoting N rows with G unique groups.

I'll be using this N and G syntax to explain how these transformations mediate the shape of the resultant table.

For these conceptual sections, I'll be using SQL via DuckDB to showcase the operations before diving into both pandas and Polars.

Grouped Aggregations

G/N → G

Grouped aggregations take our N×G table and output a table of shape G. This means that, after our operation, we will have as many rows in the result as we do unique groups (e.g., one row per unique group).

import duckdb

duckdb.sql('''
    select
        sensor,
        avg(temperature) as avg_temp
    from df
    group by sensor
    order by sensor
''')

┌─────────┬───────────────────┐
│ sensor  │     avg_temp      │
│ varchar │      double       │
├─────────┼───────────────────┤
│ a       │             61.25 │
│ b       │ 72.16666666666667 │
└─────────┴───────────────────┘

Window Functions (Partitioned Aggregations)

G/N → N×G|N

Window functions are designed to take a group of rows and output a result that is the same shape as its input. In this instance, let’s take the aggregated maximum temperature from each grouping and map it back to our original DataFrame. This is useful if we want to broadcast some aggregate operation from a group back to each individual row.

For example, we can calculate the temperature deviation, "How far away is each observed temperature reading from the average temperature for that group?"

One (subpar) way to organize this query is to perform a group by aggregation and join the aggregated table back to the original on the grouping column.

# Bad approach

duckdb.sql('''
    with agg as (
        select
            sensor, 
            round(avg(temperature), 2) as temp_avg
        from df
        group by sensor
    )
    
    select 
        df.*, 
        agg.temp_avg,
        round(df.temperature - agg.temp_avg, 2) as temp_deviation,
    from df
    join agg on df.sensor = agg.sensor
    order by df.sensor
''')

┌─────────┬────────────┬─────────────┬──────────┬────────────────┐
│ sensor  │    date    │ temperature │ temp_avg │ temp_deviation │
│ varchar │    date    │    int64    │  double  │     double     │
├─────────┼────────────┼─────────────┼──────────┼────────────────┤
│ a       │ 2000-01-01 │          56 │    61.25 │          -5.25 │
│ a       │ 2000-01-03 │          59 │    61.25 │          -2.25 │
│ a       │ 2000-01-06 │          68 │    61.25 │           6.75 │
│ a       │ 2000-01-07 │          62 │    61.25 │           0.75 │
│ b       │ 2000-01-01 │          76 │    72.17 │           3.83 │
│ b       │ 2000-01-03 │          76 │    72.17 │           3.83 │
│ b       │ 2000-01-05 │          64 │    72.17 │          -8.17 │
│ b       │ 2000-01-07 │          69 │    72.17 │          -3.17 │
│ b       │ 2000-01-08 │          81 │    72.17 │           8.83 │
│ b       │ 2000-01-10 │          67 │    72.17 │          -5.17 │
├─────────┴────────────┴─────────────┴──────────┴────────────────┤
│ 10 rows                                              5 columns │
└────────────────────────────────────────────────────────────────┘

While we manage to obtain the desired result, the approach is a bit clunky as it requires separating the aggregate operation and the rejoining. Instead, SQL provides a more concise syntax leading to better query optimization. For window functions, we can leave the group by clause behind, instead using an over (partition by …) clause inside of the column selection. This makes sense because the result of the window function will be the same size as all of the other columns, making them directly comparable with one another.

# Good approach

duckdb.sql('''
    select
        *,
        round(
            avg(temperature) over (partition by sensor), 2
        ) as temp_avg,
        round(temperature - temp_avg, 2) as temp_deviation,
    from df
    order by sensor, date
''')

┌─────────┬────────────┬─────────────┬──────────┬────────────────┐
│ sensor  │    date    │ temperature │ temp_avg │ temp_deviation │
│ varchar │    date    │    int64    │  double  │     double     │
├─────────┼────────────┼─────────────┼──────────┼────────────────┤
│ a       │ 2000-01-01 │          56 │    61.25 │          -5.25 │
│ a       │ 2000-01-03 │          59 │    61.25 │          -2.25 │
│ a       │ 2000-01-06 │          68 │    61.25 │           6.75 │
│ a       │ 2000-01-07 │          62 │    61.25 │           0.75 │
│ b       │ 2000-01-01 │          76 │    72.17 │           3.83 │
│ b       │ 2000-01-03 │          76 │    72.17 │           3.83 │
│ b       │ 2000-01-05 │          64 │    72.17 │          -8.17 │
│ b       │ 2000-01-07 │          69 │    72.17 │          -3.17 │
│ b       │ 2000-01-08 │          81 │    72.17 │           8.83 │
│ b       │ 2000-01-10 │          67 │    72.17 │          -5.17 │
├─────────┴────────────┴─────────────┴──────────┴────────────────┤
│ 10 rows                                              5 columns │
└────────────────────────────────────────────────────────────────┘

Window Functions (Running)

While this is not a formal differentiation, I previously mentioned that we can supply an aggregation to our window operation. However, since we have access to all of the values within a partition (grouping), we are not limited to simple aggregate functions.

By specifying an order by clause within … over (partition by … order by …), we can alter the window of rows that SQL examines before calculating our specified aggregation. In this case, we can perform cumulative aggregations (or running aggregations). If we wanted to query the running maximum temperature encountered within our group, we can use the following query:

duckdb.sql('''
    select
        *,
        max(temperature) over (partition by sensor              )
            as temp_sensor_max,
            
        max(temperature) over (partition by sensor order by date) 
            as temp_sensor_max_running,
    from df
    order by sensor, date
''')

┌─────────┬────────────┬─────────────┬─────────────────┬─────────────────────────┐
│ sensor  │    date    │ temperature │ temp_sensor_max │ temp_sensor_max_running │
│ varchar │    date    │    int64    │      int64      │          int64          │
├─────────┼────────────┼─────────────┼─────────────────┼─────────────────────────┤
│ a       │ 2000-01-01 │          56 │              68 │                      56 │
│ a       │ 2000-01-03 │          59 │              68 │                      59 │
│ a       │ 2000-01-06 │          68 │              68 │                      68 │
│ a       │ 2000-01-07 │          62 │              68 │                      68 │
│ b       │ 2000-01-01 │          76 │              81 │                      76 │
│ b       │ 2000-01-03 │          76 │              81 │                      76 │
│ b       │ 2000-01-05 │          64 │              81 │                      76 │
│ b       │ 2000-01-07 │          69 │              81 │                      76 │
│ b       │ 2000-01-08 │          81 │              81 │                      81 │
│ b       │ 2000-01-10 │          67 │              81 │                      81 │
├─────────┴────────────┴─────────────┴─────────────────┴─────────────────────────┤
│ 10 rows                                                              5 columns │
└────────────────────────────────────────────────────────────────────────────────┘

Note that the passed aggregation for both selections was max(temperature). The difference here is the presence of the order by within our over clause. This part of the query was evaluated on every subset of rows starting with the first row of a partition, then the first two rows of a partition, then the first three rows, etc. until we have consumed all rows in the partition.

You might recognize this pattern as a "cumulative maximum," and you can liken all specifications of over (… order by …) as cumulative operations.

Wrap-Up