Tables: Window Functions vs Group By#

Theoretical Comparison#

Both “window” and “group by” operations are used to perform operations across a subset of rows of a table. These rows are subsetted based on a unique grouping of values within a column.

For example the following table has 3 columns 'sensor', 'date', and 'temperature' to represent different temperatures collected by different sensors (presumably in different locations) across different days.

import pandas as pd
import pyarrow as pa
from numpy.random import default_rng
rng = default_rng(8)

df = (
    pd.MultiIndex.from_product(
        iterables=[['a', 'b'], pd.date_range('2000-01-01', periods=10, tz='US/Pacific')],
        names=['sensor', 'date']
    )
    .to_frame(index=False)
    .assign(
        temperature=lambda d: rng.normal(70, scale=8, size=len(d)).astype(int)
    )
    .sample(n=10, random_state=rng)
    .sort_values(by=['sensor', 'date'])
    .reset_index(drop=True)
    .astype({
        'date': pd.ArrowDtype(pa.date64()),
    })
)

# df = pl.from_pandas(df)

df

Examining the table above, we can conclude that

  • The total DataFrame has N rows (10 rows)

  • There exist G groups of unique sensors (2 unique sensors)

N×G represents the DataFrame in its entirety denoting N rows with G unique groups.

I’ll be using this N & G syntax to explain how these transformations mediates the shape of the resultant table.

For these conceptual sections, I’ll be using SQL via duckdb to showcase the operations before diving in both pandas and polars.