Tables: Window Functions vs Group By#
Hello, everyone! This week, I want to dive into “window” and “group by.” What’s the difference? When should you use one over the other? Let’s take a look.
Theoretical Comparison#
Both window and group by functions are used to perform operations across a subset of rows of a table. These rows are subsetted based on a unique grouping of values within a column.
For example, the following table has 3 columns: 'sensor'
, 'date'
, and 'temperature'
.
These represent temperatures collected by sensors (presumably
in different locations) across multiple days.
import pandas as pd
import pyarrow as pa
from numpy.random import default_rng
rng = default_rng(8)
df = (
pd.MultiIndex.from_product(
iterables=[
['a', 'b'],
pd.date_range('2000-01-01', periods=10, tz='US/Pacific')
],
names=['sensor', 'date']
)
.to_frame(index=False)
.assign(
temperature=lambda d: rng.normal(70, scale=8, size=len(d)).astype(int)
)
.sample(n=10, random_state=rng)
.sort_values(by=['sensor', 'date'])
.reset_index(drop=True)
.astype({
'date': pd.ArrowDtype(pa.date64()),
})
)
df
sensor | date | temperature | |
---|---|---|---|
0 | a | 2000-01-01 | 56 |
1 | a | 2000-01-03 | 59 |
2 | a | 2000-01-06 | 68 |
3 | a | 2000-01-07 | 62 |
4 | b | 2000-01-01 | 76 |
5 | b | 2000-01-03 | 76 |
6 | b | 2000-01-05 | 64 |
7 | b | 2000-01-07 | 69 |
8 | b | 2000-01-08 | 81 |
9 | b | 2000-01-10 | 67 |
Examining the table above, we can conclude that
The total DataFrame has N rows (10 rows)
There exist G groups of unique sensors (2 unique sensors)
G/N represents the DataFrame in its entirety, denoting N rows with G unique groups.
I’ll be using this N and G syntax to explain how these transformations mediate the shape of the resultant table.
For these conceptual sections, I’ll be using SQL via DuckDB to showcase the operations before diving into both pandas and Polars.
Grouped Aggregations#
G/N → G
Grouped aggregations take our N×G table and output a table of shape G. This means that, after our operation, we will have as many rows in the result as we do unique groups (e.g., one row per unique group).
import duckdb
duckdb.sql('''
select
sensor,
avg(temperature) as avg_temp
from df
group by sensor
order by sensor
''')
┌─────────┬───────────────────┐
│ sensor │ avg_temp │
│ varchar │ double │
├─────────┼───────────────────┤
│ a │ 61.25 │
│ b │ 72.16666666666667 │
└─────────┴───────────────────┘
Window Functions (Partitioned Aggregations)#
G/N → N×G|N
Window functions are designed to take a group of rows and output a result that is the same shape as its input. In this instance, let’s take the aggregated maximum temperature from each grouping and map it back to our original DataFrame. This is useful if we want to broadcast some aggregate operation from a group back to each individual row.
For example, we can calculate the temperature deviation, “How far away is each observed temperature reading from the average temperature for that group?”
One (subpar) way to organize this query is to perform a group by aggregation and join the aggregated table back to the original on the grouping column.
# Bad approach
duckdb.sql('''
with agg as (
select
sensor,
round(avg(temperature), 2) as temp_avg
from df
group by sensor
)
select
df.*,
agg.temp_avg,
round(df.temperature - agg.temp_avg, 2) as temp_deviation,
from df
join agg on df.sensor = agg.sensor
order by df.sensor
''')
┌─────────┬────────────┬─────────────┬──────────┬────────────────┐
│ sensor │ date │ temperature │ temp_avg │ temp_deviation │
│ varchar │ date │ int64 │ double │ double │
├─────────┼────────────┼─────────────┼──────────┼────────────────┤
│ a │ 2000-01-01 │ 56 │ 61.25 │ -5.25 │
│ a │ 2000-01-03 │ 59 │ 61.25 │ -2.25 │
│ a │ 2000-01-06 │ 68 │ 61.25 │ 6.75 │
│ a │ 2000-01-07 │ 62 │ 61.25 │ 0.75 │
│ b │ 2000-01-01 │ 76 │ 72.17 │ 3.83 │
│ b │ 2000-01-03 │ 76 │ 72.17 │ 3.83 │
│ b │ 2000-01-05 │ 64 │ 72.17 │ -8.17 │
│ b │ 2000-01-07 │ 69 │ 72.17 │ -3.17 │
│ b │ 2000-01-08 │ 81 │ 72.17 │ 8.83 │
│ b │ 2000-01-10 │ 67 │ 72.17 │ -5.17 │
├─────────┴────────────┴─────────────┴──────────┴────────────────┤
│ 10 rows 5 columns │
└────────────────────────────────────────────────────────────────┘
While we manage to obtain the desired result, the approach is a bit clunky as it requires
separating the aggregate operation and the rejoining. Instead, SQL provides
a more concise syntax leading to better query optimization. For window functions,
we can leave the group by
clause behind, instead using an over (partition by …)
clause inside of the column selection. This makes sense because the result of
the window function will be the same size as all of the other columns, making
them directly comparable with one another.
# Good approach
duckdb.sql('''
select
*,
round(
avg(temperature) over (partition by sensor), 2
) as temp_avg,
round(temperature - temp_avg, 2) as temp_deviation,
from df
order by sensor, date
''')
┌─────────┬────────────┬─────────────┬──────────┬────────────────┐
│ sensor │ date │ temperature │ temp_avg │ temp_deviation │
│ varchar │ date │ int64 │ double │ double │
├─────────┼────────────┼─────────────┼──────────┼────────────────┤
│ a │ 2000-01-01 │ 56 │ 61.25 │ -5.25 │
│ a │ 2000-01-03 │ 59 │ 61.25 │ -2.25 │
│ a │ 2000-01-06 │ 68 │ 61.25 │ 6.75 │
│ a │ 2000-01-07 │ 62 │ 61.25 │ 0.75 │
│ b │ 2000-01-01 │ 76 │ 72.17 │ 3.83 │
│ b │ 2000-01-03 │ 76 │ 72.17 │ 3.83 │
│ b │ 2000-01-05 │ 64 │ 72.17 │ -8.17 │
│ b │ 2000-01-07 │ 69 │ 72.17 │ -3.17 │
│ b │ 2000-01-08 │ 81 │ 72.17 │ 8.83 │
│ b │ 2000-01-10 │ 67 │ 72.17 │ -5.17 │
├─────────┴────────────┴─────────────┴──────────┴────────────────┤
│ 10 rows 5 columns │
└────────────────────────────────────────────────────────────────┘
Window Functions (Running)#
While this is not a formal differentiation, I previously mentioned that we can supply an aggregation to our window operation. However, since we have access to all of the values within a partition (grouping), we are not limited to simple aggregate functions.
By specifying an order by
clause within … over (partition by … order by …)
,
we can alter the window of rows that SQL examines before calculating our specified
aggregation. In this case, we can perform cumulative aggregations (or running aggregations).
If we wanted to query the running maximum temperature encountered within our group,
we can use the following query:
duckdb.sql('''
select
*,
max(temperature) over (partition by sensor )
as temp_sensor_max,
max(temperature) over (partition by sensor order by date)
as temp_sensor_max_running,
from df
order by sensor, date
''')
┌─────────┬────────────┬─────────────┬─────────────────┬─────────────────────────┐
│ sensor │ date │ temperature │ temp_sensor_max │ temp_sensor_max_running │
│ varchar │ date │ int64 │ int64 │ int64 │
├─────────┼────────────┼─────────────┼─────────────────┼─────────────────────────┤
│ a │ 2000-01-01 │ 56 │ 68 │ 56 │
│ a │ 2000-01-03 │ 59 │ 68 │ 59 │
│ a │ 2000-01-06 │ 68 │ 68 │ 68 │
│ a │ 2000-01-07 │ 62 │ 68 │ 68 │
│ b │ 2000-01-01 │ 76 │ 81 │ 76 │
│ b │ 2000-01-03 │ 76 │ 81 │ 76 │
│ b │ 2000-01-05 │ 64 │ 81 │ 76 │
│ b │ 2000-01-07 │ 69 │ 81 │ 76 │
│ b │ 2000-01-08 │ 81 │ 81 │ 81 │
│ b │ 2000-01-10 │ 67 │ 81 │ 81 │
├─────────┴────────────┴─────────────┴─────────────────┴─────────────────────────┤
│ 10 rows 5 columns │
└────────────────────────────────────────────────────────────────────────────────┘
Note that the passed aggregation for both selections was max(temperature)
. The
difference here is the presence of the order by
within our over
clause. This
part of the query was evaluated on every subset of rows starting with the first
row of a partition, then the first two rows of a partition, then the first three
rows, etc. until we have consumed all rows in the partition.
You might recognize this pattern as a “cumulative maximum,” and you can
liken all specifications of over (… order by …)
as cumulative operations.
Wrap-Up#
That’s all the time that we have today! Next week, we’ll dive into how we organize these same concepts using our Python DataFrame tools: pandas & Polars. In the meantime make sure to join our Discord to talk more about window operations!