Tabular Group By Sets

Hi all, welcome back to Cameron's Corner! This week, I want to replicate some convenient analytical functionality from DuckDB in both pandas and Polars.

Before we get started, I want to let you know about our upcoming public seminar series, “(Even More) Python Basics for Experts.” Join James in this covering (even more) Python basics that any aspiring Python expert needs to know in order to make their code more effective and efficient. He’ll tackle what’s real, how we can tell it’s real, and how we can do less work.

Back to the topic at hand!

Today, I set out to replicate DuckDB's "Grouping Sets" functionality in Polars and pandas, which is an analytical function to group by pre-defined, nested, or permuted sets of columns across which we would aggregate.

While this convenient functionality exists in DuckDB, it's actually not built into either pandas or Polars (there's even an open issue in Polars requesting this functionality). I want to discuss why understanding and using Python as a language is important, even when the bulk of our work is analytical.

Data

Let’s start by modeling some data that have a few categorical (grouping) columns for us to compare across.

from pandas import DataFrame, Timestamp, to_timedelta
from numpy.random import default_rng
from polars import from_pandas

rng = default_rng(0)

center_ids = [f"Center_{i}" for i in range(2)]
locations = ["East", "West"]
service_types = ["Web Hosting", "Data Processing", "Cloud Storage"]

pd_df = DataFrame({
    "center_id":    rng.choice(center_ids, size=(size := 100)),
    "location":     rng.choice(locations, size=size),
    "service_type": rng.choice(service_types, size=size),
    "timestamp":    Timestamp.now() - to_timedelta(rng.integers(0, 3_650, size=size), unit='D'),
    "cpu_usage":    rng.uniform(0, 100, size=size),
    "mem_usage":    rng.uniform(0, 64, size=size),
})

pl_df = from_pandas(pd_df)
pd_df.head()

	center_id	location	service_type	timestamp	cpu_usage	mem_usage
0	Center_1	East	Web Hosting	2023-11-22 15:50:03.116313	31.968164	57.395279
1	Center_1	West	Data Processing	2024-12-03 15:50:03.116313	18.750772	37.332483
2	Center_1	East	Web Hosting	2024-01-01 15:50:03.116313	67.252663	2.573966
3	Center_0	East	Web Hosting	2021-05-17 15:50:03.116313	19.510740	45.535157
4	Center_0	West	Cloud Storage	2023-12-05 15:50:03.116313	57.768789	36.417655

Grouping Sets

In this first exercise, we want to understand the average 'cpu_usage' and 'mem_usage' for the following cross sections:

within each combination of 'center_id', 'location', 'service_type'
within 'service_type', ignoring 'center_id' and 'location'
across all centers, regardless of any other categories.

DuckDB - Grouping Sets

DuckDB provides an incredibly easy syntax: group by grouping sets where we can input these grouping sets with ease.

import duckdb


duckdb.sql('''
    select center_id, location, service_type, avg(cpu_usage), avg(mem_usage)
    from pl_df
    group by grouping sets (
        (center_id, location, service_type),
        (service_type),
        ()
    )
    order by (center_id, location, service_type);
''')

┌───────────┬──────────┬─────────────────┬────────────────────┬────────────────────┐
│ center_id │ location │  service_type   │   avg(cpu_usage)   │   avg(mem_usage)   │
│  varchar  │ varchar  │     varchar     │       double       │       double       │
├───────────┼──────────┼─────────────────┼────────────────────┼────────────────────┤
│ Center_0  │ East     │ Cloud Storage   │ 48.573814147140006 │  34.59012485147718 │
│ Center_0  │ East     │ Data Processing │  38.89813718396443 │ 31.946375637497113 │
│ Center_0  │ East     │ Web Hosting     │ 49.510621712697535 │  31.49215279790748 │
│ Center_0  │ West     │ Cloud Storage   │  59.67797041313039 │  39.92657027278433 │
│ Center_0  │ West     │ Data Processing │    54.954340526705 │ 21.972751748733323 │
│ Center_0  │ West     │ Web Hosting     │  65.74395477203684 │  40.34816417380653 │
│ Center_1  │ East     │ Cloud Storage   │  58.06807588431277 │ 26.131857675662502 │
│ Center_1  │ East     │ Data Processing │ 52.559742275163295 │  35.97783969816975 │
│ Center_1  │ East     │ Web Hosting     │  53.60845973063669 │ 27.885182582416054 │
│ Center_1  │ West     │ Cloud Storage   │  67.40609965208978 │  32.27891608705559 │
│ Center_1  │ West     │ Data Processing │  40.08796317401972 │ 35.195382508091186 │
│ Center_1  │ West     │ Web Hosting     │  49.16242082013778 │  35.98280493510762 │
│ NULL      │ NULL     │ Cloud Storage   │  58.60578093193317 │  32.23054243368773 │
│ NULL      │ NULL     │ Data Processing │  47.75415518833317 │ 29.978704128286843 │
│ NULL      │ NULL     │ Web Hosting     │   54.8832619326503 │ 33.790532776407105 │
│ NULL      │ NULL     │ NULL            │ 54.155927816540164 │  32.12350594507323 │
├───────────┴──────────┴─────────────────┴────────────────────┴────────────────────┤
│ 16 rows                                                                5 columns │
└──────────────────────────────────────────────────────────────────────────────────┘

Polars - Grouping Sets

In contrast, Polars has no built-in way to do this. We’ll need to build our query part by part and concatenate all of them. We can do this with a polars.LazyFrame in order to have the query engine make as many optimizations as possible.

This code implements the same approach that DuckDB takes: collecting separate aggregations and vertically concatenating them into the same set of results.

from polars import col, concat as pl_concat

aggfuncs = [col('cpu_usage').mean(), col('mem_usage').mean()]
groupings = [
    ['center_id', 'location', 'service_type'],
    ['service_type'],
    [],
]

frames = []
ldf = pl_df.lazy()
for gs in groupings:
    if not gs:
        query = ldf.select(aggfuncs)
    else:
        query = ldf.group_by(gs).agg(aggfuncs)
    frames.append(query)
    
pl_concat(frames, how='diagonal').collect()

shape: (16, 5)

center_id	location	service_type	cpu_usage	mem_usage
str	str	str	f64	f64
"Center_1"	"West"	"Data Processing"	40.087963	35.195383
"Center_1"	"East"	"Data Processing"	52.559742	35.97784
"Center_0"	"West"	"Cloud Storage"	59.67797	39.92657
"Center_0"	"West"	"Data Processing"	54.954341	21.972752
"Center_0"	"East"	"Cloud Storage"	48.573814	34.590125
…	…	…	…	…
"Center_1"	"East"	"Web Hosting"	53.60846	27.885183
null	null	"Web Hosting"	54.883262	33.790533
null	null	"Cloud Storage"	58.605781	32.230542
null	null	"Data Processing"	47.754155	29.978704
null	null	null	54.155928	32.123506

pandas - Grouping Sets

Finally, we can reproduce this same analysis in pandas. I find both the approach and the syntax to be quite similar to that of Polars. The only downside is that we will need to eagerly compute each of these aggregations before concatenating them, a task that Polars handles in parallel.

from pandas import concat as pd_concat

results = []
aggfuncs = {'cpu_usage': 'mean', 'mem_usage': 'mean'}
groupings = [
    ['center_id', 'location', 'service_type'],
    ['service_type'],
    [],
]

for gs in groupings:
    if not gs:
        res = pd_df.agg(aggfuncs).to_frame().T
    else:
        res = pd_df.groupby(gs, as_index=False).agg(aggfuncs)
    results.append(res)
    
pd_concat(results, ignore_index=True)

	center_id	location	service_type	cpu_usage	mem_usage
0	Center_0	East	Cloud Storage	48.573814	34.590125
1	Center_0	East	Data Processing	38.898137	31.946376
2	Center_0	East	Web Hosting	49.510622	31.492153
3	Center_0	West	Cloud Storage	59.677970	39.926570
4	Center_0	West	Data Processing	54.954341	21.972752
5	Center_0	West	Web Hosting	65.743955	40.348164
6	Center_1	East	Cloud Storage	58.068076	26.131858
7	Center_1	East	Data Processing	52.559742	35.977840
8	Center_1	East	Web Hosting	53.608460	27.885183
9	Center_1	West	Cloud Storage	67.406100	32.278916
10	Center_1	West	Data Processing	40.087963	35.195383
11	Center_1	West	Web Hosting	49.162421	35.982805
12	NaN	NaN	Cloud Storage	58.605781	32.230542
13	NaN	NaN	Data Processing	47.754155	29.978704
14	NaN	NaN	Web Hosting	54.883262	33.790533
15	NaN	NaN	NaN	54.155928	32.123506

Roll Up

A roll up is a convenient way to express hierarchical grouping sets. In this case instead of manually specifying our sets we can use combinatorics. The idea is that if we wanted to see aggregation at varied hierarchical levels we can express them like so:

rollup('center_id', 'location', 'service_type')

# produces the following groupings
['center_id', 'location', 'service_type']
['location', 'service_type']
['service_type']
[]

DuckDB - Roll Up

With DuckDB, we can use the built-in rollup syntax to express this idea:

import duckdb

duckdb.sql('''
    select center_id, location, service_type, avg(cpu_usage), avg(mem_usage)
    from pl_df
    group by rollup (center_id, location, service_type)
    order by (center_id, location, service_type);
''')

┌───────────┬──────────┬─────────────────┬────────────────────┬────────────────────┐
│ center_id │ location │  service_type   │   avg(cpu_usage)   │   avg(mem_usage)   │
│  varchar  │ varchar  │     varchar     │       double       │       double       │
├───────────┼──────────┼─────────────────┼────────────────────┼────────────────────┤
│ Center_0  │ East     │ Cloud Storage   │ 48.573814147140006 │  34.59012485147718 │
│ Center_0  │ East     │ Data Processing │  38.89813718396443 │ 31.946375637497113 │
│ Center_0  │ East     │ Web Hosting     │ 49.510621712697535 │  31.49215279790748 │
│ Center_0  │ East     │ NULL            │  47.19697970618814 │ 33.030178971279625 │
│ Center_0  │ West     │ Cloud Storage   │  59.67797041313039 │  39.92657027278433 │
│ Center_0  │ West     │ Data Processing │    54.954340526705 │ 21.972751748733323 │
│ Center_0  │ West     │ Web Hosting     │  65.74395477203684 │  40.34816417380653 │
│ Center_0  │ West     │ NULL            │  59.77552339385185 │  32.75258291517836 │
│ Center_0  │ NULL     │ NULL            │ 54.915631514527234 │  32.85983593685385 │
│ Center_1  │ East     │ Cloud Storage   │  58.06807588431277 │ 26.131857675662502 │
│ Center_1  │ East     │ Data Processing │ 52.559742275163295 │  35.97783969816975 │
│ Center_1  │ East     │ Web Hosting     │  53.60845973063669 │ 27.885182582416054 │
│ Center_1  │ East     │ NULL            │   55.3324528223062 │  28.57887496447768 │
│ Center_1  │ West     │ Cloud Storage   │  67.40609965208978 │  32.27891608705559 │
│ Center_1  │ West     │ Data Processing │  40.08796317401972 │ 35.195382508091186 │
│ Center_1  │ West     │ Web Hosting     │  49.16242082013778 │  35.98280493510762 │
│ Center_1  │ West     │ NULL            │  51.78558271393731 │ 34.511046938585025 │
│ Center_1  │ NULL     │ NULL            │  53.55901776812175 │ 31.544960951531376 │
│ NULL      │ NULL     │ NULL            │ 54.155927816540164 │  32.12350594507323 │
├───────────┴──────────┴─────────────────┴────────────────────┴────────────────────┤
│ 19 rows                                                                5 columns │
└──────────────────────────────────────────────────────────────────────────────────┘

But not let's move onto our implementation in tools that do not have this functionality built-in. We'll first need to generate these groupings at the Python-level so we can have both Polars and pandas operate along these groups.

Roll Up Formulation

I always say that good usage of 3rd party Python packages (Polars, pandas, Matplotlib, etc.) requires a good understanding and application of Python. In this case, we will need to use Python to create our Rollup groups and thankfully itertools makes this quite straightforward.

from itertools import islice

def rollup(*items):
    "produce shrinking subsets of items"
    return (
        [*islice(items, i, None)] for i in range(len(items)+1)
    )

for gs in rollup('center_id', 'location', 'service_type'):
    print(gs)

['center_id', 'location', 'service_type']
['location', 'service_type']
['service_type']
[]

Now that we have a generator to create these groups, let's apply it to our DataFrames to see it in action. The approach nearly identical to our grouping sets implementation.

Polars - Roll Up

from polars import col, concat as pl_concat, lit

aggfuncs = [col('cpu_usage').mean(), col('mem_usage').mean()]

frames = []
ldf = pl_df.lazy()
for gs in  rollup('center_id', 'location', 'service_type'):
    if not gs:
        query = ldf.select(aggfuncs)
    else:
        query = ldf.group_by(gs).agg(aggfuncs)
    frames.append(
        query.with_columns(groupings=lit(gs))
    )
    
pl_concat(frames, how='diagonal_relaxed').collect()

shape: (22, 6)

center_id	location	service_type	cpu_usage	mem_usage	groupings
str	str	str	f64	f64	list[str]
"Center_0"	"West"	"Cloud Storage"	59.67797	39.92657	["center_id", "location", "service_type"]
"Center_0"	"West"	"Data Processing"	54.954341	21.972752	["center_id", "location", "service_type"]
"Center_1"	"East"	"Cloud Storage"	58.068076	26.131858	["center_id", "location", "service_type"]
"Center_1"	"East"	"Web Hosting"	53.60846	27.885183	["center_id", "location", "service_type"]
"Center_1"	"West"	"Data Processing"	40.087963	35.195383	["center_id", "location", "service_type"]
…	…	…	…	…	…
null	"East"	"Cloud Storage"	54.270371	29.515165	["location", "service_type"]
null	null	"Web Hosting"	54.883262	33.790533	["service_type"]
null	null	"Cloud Storage"	58.605781	32.230542	["service_type"]
null	null	"Data Processing"	47.754155	29.978704	["service_type"]
null	null	null	54.155928	32.123506	[]

Pandas - Roll Up

from pandas import concat, Series as pd_Series

results = []
aggfuncs = {'cpu_usage': 'mean', 'mem_usage': 'mean'}
for gs in rollup('center_id', 'location', 'service_type'):
    if not gs:
        res = pd_df.agg(aggfuncs).to_frame().T
    else:
        res = pd_df.groupby(list(gs), as_index=False).agg(aggfuncs)
    results.append(
        res.assign(
            groupings=lambda d: pd_Series([gs] * len(d), index=d.index)
        )
    )
    
concat(results, ignore_index=True)

	center_id	location	service_type	cpu_usage	mem_usage	groupings
0	Center_0	East	Cloud Storage	48.573814	34.590125	[center_id, location, service_type]
1	Center_0	East	Data Processing	38.898137	31.946376	[center_id, location, service_type]
2	Center_0	East	Web Hosting	49.510622	31.492153	[center_id, location, service_type]
3	Center_0	West	Cloud Storage	59.677970	39.926570	[center_id, location, service_type]
4	Center_0	West	Data Processing	54.954341	21.972752	[center_id, location, service_type]
5	Center_0	West	Web Hosting	65.743955	40.348164	[center_id, location, service_type]
6	Center_1	East	Cloud Storage	58.068076	26.131858	[center_id, location, service_type]
7	Center_1	East	Data Processing	52.559742	35.977840	[center_id, location, service_type]
8	Center_1	East	Web Hosting	53.608460	27.885183	[center_id, location, service_type]
9	Center_1	West	Cloud Storage	67.406100	32.278916	[center_id, location, service_type]
10	Center_1	West	Data Processing	40.087963	35.195383	[center_id, location, service_type]
11	Center_1	West	Web Hosting	49.162421	35.982805	[center_id, location, service_type]
12	NaN	East	Cloud Storage	54.270371	29.515165	[location, service_type]
13	NaN	East	Data Processing	47.436640	34.466041	[location, service_type]
14	NaN	East	Web Hosting	52.162164	29.158231	[location, service_type]
15	NaN	West	Cloud Storage	64.025043	35.624765	[location, service_type]
16	NaN	West	Data Processing	47.875113	28.269243	[location, service_type]
17	NaN	West	Web Hosting	57.453188	38.165485	[location, service_type]
18	NaN	NaN	Cloud Storage	58.605781	32.230542	[service_type]
19	NaN	NaN	Data Processing	47.754155	29.978704	[service_type]
20	NaN	NaN	Web Hosting	54.883262	33.790533	[service_type]
21	NaN	NaN	NaN	54.155928	32.123506	[]

Cube

For our final recreation, we are going to implement the "group by cube" syntax that DuckDB offers. Let's take a look at the result set.

Duckdb - Cube

import duckdb


duckdb.sql('''
    select center_id, location, service_type, avg(cpu_usage), avg(mem_usage)
    from pl_df
    group by cube (center_id, location, service_type)
    order by (center_id, location, service_type);
''')

┌───────────┬──────────┬─────────────────┬────────────────────┬────────────────────┐
│ center_id │ location │  service_type   │   avg(cpu_usage)   │   avg(mem_usage)   │
│  varchar  │ varchar  │     varchar     │       double       │       double       │
├───────────┼──────────┼─────────────────┼────────────────────┼────────────────────┤
│ Center_0  │ East     │ Cloud Storage   │ 48.573814147140006 │  34.59012485147718 │
│ Center_0  │ East     │ Data Processing │  38.89813718396443 │ 31.946375637497113 │
│ Center_0  │ East     │ Web Hosting     │ 49.510621712697535 │  31.49215279790748 │
│ Center_0  │ East     │ NULL            │  47.19697970618814 │ 33.030178971279625 │
│ Center_0  │ West     │ Cloud Storage   │  59.67797041313039 │  39.92657027278433 │
│ Center_0  │ West     │ Data Processing │    54.954340526705 │ 21.972751748733323 │
│ Center_0  │ West     │ Web Hosting     │  65.74395477203684 │  40.34816417380653 │
│ Center_0  │ West     │ NULL            │  59.77552339385185 │  32.75258291517836 │
│ Center_0  │ NULL     │ Cloud Storage   │ 53.755753737935514 │  37.08046604808719 │
│ Center_0  │ NULL     │ Data Processing │ 51.513725524689164 │ 24.109956867754132 │
│  ·        │  ·       │      ·          │          ·         │          ·         │
│  ·        │  ·       │      ·          │          ·         │          ·         │
│  ·        │  ·       │      ·          │          ·         │          ·         │
│ NULL      │ East     │ Web Hosting     │  52.16216395959934 │ 29.158230893765968 │
│ NULL      │ East     │ NULL            │ 52.259051867328274 │  30.26047870038065 │
│ NULL      │ West     │ Cloud Storage   │  64.02504311004503 │ 35.624764793311925 │
│ NULL      │ West     │ Data Processing │  47.87511321590248 │  28.26924258652279 │
│ NULL      │ West     │ Web Hosting     │  57.45318779608731 │  38.16548455445707 │
│ NULL      │ West     │ NULL            │  55.70791722953172 │  33.64780096345814 │
│ NULL      │ NULL     │ Cloud Storage   │  58.60578093193317 │  32.23054243368773 │
│ NULL      │ NULL     │ Data Processing │  47.75415518833317 │ 29.978704128286843 │
│ NULL      │ NULL     │ Web Hosting     │   54.8832619326503 │ 33.790532776407105 │
│ NULL      │ NULL     │ NULL            │ 54.155927816540164 │  32.12350594507323 │
├───────────┴──────────┴─────────────────┴────────────────────┴────────────────────┤
│ 36 rows (20 shown)                                                     5 columns │
└──────────────────────────────────────────────────────────────────────────────────┘

Cube Formulation

In terms of combinatorics, "cube" is effectively a reversed powerset of the inputted elements. Where a powerset is all possible combinations of different sizes from a given set of inputs. The following pseudocode snippet displays the expected inputs and outputs for this pattern.

cube('center_id', 'location', 'service_type')

# produces the following groupings
['center_id', 'location', 'service_type']
['center_id', 'location']
['center_id', 'service_type']
['location', 'service_type']
['center_id']
['location']
['service_type']
[]

The above pattern is incredibly similar to the powerset recipe in itertools with the caveat that cube is reversed so that it begins with the largest group instead of the smallest.

Our rewrite in Python would look like:

from itertools import chain, combinations

def cube(*items):
    """reversed version of itertools powerset recipe"""
    for size in range(len(items)+1, -1, -1):
        for combo in combinations(items, size):
            yield [*combo]

for gs in cube('center_id', 'location', 'service_type'):
    print(gs)

['center_id', 'location', 'service_type']
['center_id', 'location']
['center_id', 'service_type']
['location', 'service_type']
['center_id']
['location']
['service_type']
[]

Polars Cube

from polars import col, concat as pl_concat, lit

aggfuncs = [col('cpu_usage').mean(), col('mem_usage').mean()]

frames = []
ldf = pl_df.lazy()
for gs in cube('center_id', 'location', 'service_type'):
    if not gs:
        query = ldf.select(aggfuncs)
    else:
        query = ldf.group_by(gs).agg(aggfuncs)
    frames.append(
        query.with_columns(groupings=lit(gs))
    )
    
pl_concat(frames, how='diagonal_relaxed').collect()

shape: (36, 6)

center_id	location	service_type	cpu_usage	mem_usage	groupings
str	str	str	f64	f64	list[str]
"Center_0"	"West"	"Data Processing"	54.954341	21.972752	["center_id", "location", "service_type"]
"Center_0"	"West"	"Cloud Storage"	59.67797	39.92657	["center_id", "location", "service_type"]
"Center_0"	"East"	"Web Hosting"	49.510622	31.492153	["center_id", "location", "service_type"]
"Center_1"	"East"	"Data Processing"	52.559742	35.97784	["center_id", "location", "service_type"]
"Center_1"	"West"	"Data Processing"	40.087963	35.195383	["center_id", "location", "service_type"]
…	…	…	…	…	…
null	"East"	null	52.259052	30.260479	["location"]
null	null	"Cloud Storage"	58.605781	32.230542	["service_type"]
null	null	"Web Hosting"	54.883262	33.790533	["service_type"]
null	null	"Data Processing"	47.754155	29.978704	["service_type"]
null	null	null	54.155928	32.123506	[]

Pandas Cube

from pandas import concat, Series as pd_Series

results = []
aggfuncs = {'cpu_usage': 'mean', 'mem_usage': 'mean'}
for gs in cube('center_id', 'location', 'service_type'):
    if not gs:
        res = pd_df.agg(aggfuncs).to_frame().T
    else:
        res = pd_df.groupby(list(gs), as_index=False).agg(aggfuncs)
    results.append(
        res.assign(
            groupings=lambda d: pd_Series([gs] * len(d), index=d.index)
        )
    )
    
concat(results, ignore_index=True).fillna('') # fillna to view output easily

	center_id	location	service_type	cpu_usage	mem_usage	groupings
0	Center_0	East	Cloud Storage	48.573814	34.590125	[center_id, location, service_type]
1	Center_0	East	Data Processing	38.898137	31.946376	[center_id, location, service_type]
2	Center_0	East	Web Hosting	49.510622	31.492153	[center_id, location, service_type]
3	Center_0	West	Cloud Storage	59.677970	39.926570	[center_id, location, service_type]
4	Center_0	West	Data Processing	54.954341	21.972752	[center_id, location, service_type]
5	Center_0	West	Web Hosting	65.743955	40.348164	[center_id, location, service_type]
6	Center_1	East	Cloud Storage	58.068076	26.131858	[center_id, location, service_type]
7	Center_1	East	Data Processing	52.559742	35.977840	[center_id, location, service_type]
8	Center_1	East	Web Hosting	53.608460	27.885183	[center_id, location, service_type]
9	Center_1	West	Cloud Storage	67.406100	32.278916	[center_id, location, service_type]
10	Center_1	West	Data Processing	40.087963	35.195383	[center_id, location, service_type]
11	Center_1	West	Web Hosting	49.162421	35.982805	[center_id, location, service_type]
12	Center_0	East		47.196980	33.030179	[center_id, location]
13	Center_0	West		59.775523	32.752583	[center_id, location]
14	Center_1	East		55.332453	28.578875	[center_id, location]
15	Center_1	West		51.785583	34.511047	[center_id, location]
16	Center_0		Cloud Storage	53.755754	37.080466	[center_id, service_type]
17	Center_0		Data Processing	51.513726	24.109957	[center_id, service_type]
18	Center_0		Web Hosting	59.250622	36.805760	[center_id, service_type]
19	Center_1		Cloud Storage	62.070086	28.766311	[center_id, service_type]
20	Center_1		Data Processing	44.245223	35.456202	[center_id, service_type]
21	Center_1		Web Hosting	51.607742	31.529113	[center_id, service_type]
22		East	Cloud Storage	54.270371	29.515165	[location, service_type]
23		East	Data Processing	47.436640	34.466041	[location, service_type]
24		East	Web Hosting	52.162164	29.158231	[location, service_type]
25		West	Cloud Storage	64.025043	35.624765	[location, service_type]
26		West	Data Processing	47.875113	28.269243	[location, service_type]
27		West	Web Hosting	57.453188	38.165485	[location, service_type]
28	Center_0			54.915632	32.859836	[center_id]
29	Center_1			53.559018	31.544961	[center_id]
30		East		52.259052	30.260479	[location]
31		West		55.707917	33.647801	[location]
32			Cloud Storage	58.605781	32.230542	[service_type]
33			Data Processing	47.754155	29.978704	[service_type]
34			Web Hosting	54.883262	33.790533	[service_type]
35				54.155928	32.123506	[]

Wrap-Up

There you have it: replicating DuckDB's Group By "Grouping Sets", "Rollup", and "Cube" functionality! While both pandas and Polars requires a bit more manual work, it's entirely feasible with the right approach. A bit of understanding of pure Python takes us a long way, even when it comes to analytical tasks where we make a large use of 3rd party packages.

What did you think about these approaches? Let me know on the DUTC Discord. Talk to you all next week!