Tabular Group By Sets#

Hi all, welcome back to Cameron’s Corner! This week, I want to replicate some convenient analytical functionality from DuckDB in both pandas and Polars.

Before we get started, I want to let you know about our upcoming public seminar series, “(Even More) Python Basics for Experts.” Join James in this covering (even more) Python basics that any aspiring Python expert needs to know in order to make their code more effective and efficient. He’ll tackle what’s real, how we can tell it’s real, and how we can do less work.

Back to the topic at hand!

Today, I set out to replicate DuckDB’s “Grouping Sets” functionality in Polars and pandas, which is an analytical function to group by pre-defined, nested, or permuted sets of columns across which we would aggregate.

While this convenient functionality exists in DuckDB, it’s actually not built into either pandas or Polars (there’s even an open issue in Polars requesting this functionality). I want to discuss why understanding and using Python as a language is important, even when the bulk of our work is analytical.

Data#

Let’s start by modeling some data that have a few categorical (grouping) columns for us to compare across.

from pandas import DataFrame, Timestamp, to_timedelta
from numpy.random import default_rng
from polars import from_pandas

rng = default_rng(0)

center_ids = [f"Center_{i}" for i in range(2)]
locations = ["East", "West"]
service_types = ["Web Hosting", "Data Processing", "Cloud Storage"]

pd_df = DataFrame({
    "center_id":    rng.choice(center_ids, size=(size := 100)),
    "location":     rng.choice(locations, size=size),
    "service_type": rng.choice(service_types, size=size),
    "timestamp":    Timestamp.now() - to_timedelta(rng.integers(0, 3_650, size=size), unit='D'),
    "cpu_usage":    rng.uniform(0, 100, size=size),
    "mem_usage":    rng.uniform(0, 64, size=size),
})

pl_df = from_pandas(pd_df)
pd_df.head()
center_id location service_type timestamp cpu_usage mem_usage
0 Center_1 East Web Hosting 2023-07-19 07:05:26.116328 31.968164 57.395279
1 Center_1 West Data Processing 2024-07-30 07:05:26.116328 18.750772 37.332483
2 Center_1 East Web Hosting 2023-08-28 07:05:26.116328 67.252663 2.573966
3 Center_0 East Web Hosting 2021-01-11 07:05:26.116328 19.510740 45.535157
4 Center_0 West Cloud Storage 2023-08-01 07:05:26.116328 57.768789 36.417655

Grouping Sets#

In this first exercise, we want to understand the average 'cpu_usage' and 'mem_usage' for the following cross sections:

  1. within each combination of 'center_id', 'location', 'service_type'

  2. within 'service_type', ignoring 'center_id' and 'location'

  3. across all centers, regardless of any other categories.

DuckDB - Grouping Sets#

DuckDB provides an incredibly easy syntax: group by grouping sets where we can input these grouping sets with ease.

import duckdb


duckdb.sql('''
    select center_id, location, service_type, avg(cpu_usage), avg(mem_usage)
    from pl_df
    group by grouping sets (
        (center_id, location, service_type),
        (service_type),
        ()
    )
    order by (center_id, location, service_type);
''')
┌───────────┬──────────┬─────────────────┬────────────────────┬────────────────────┐
│ center_id │ location │  service_type   │   avg(cpu_usage)   │   avg(mem_usage)   │
│  varchar  │ varchar  │     varchar     │       double       │       double       │
├───────────┼──────────┼─────────────────┼────────────────────┼────────────────────┤
│ Center_0  │ East     │ Cloud Storage   │ 48.573814147140006 │  34.59012485147718 │
│ Center_0  │ East     │ Data Processing │  38.89813718396443 │ 31.946375637497113 │
│ Center_0  │ East     │ Web Hosting     │ 49.510621712697535 │  31.49215279790748 │
│ Center_0  │ West     │ Cloud Storage   │  59.67797041313039 │  39.92657027278433 │
│ Center_0  │ West     │ Data Processing │    54.954340526705 │ 21.972751748733323 │
│ Center_0  │ West     │ Web Hosting     │  65.74395477203684 │  40.34816417380653 │
│ Center_1  │ East     │ Cloud Storage   │  58.06807588431277 │ 26.131857675662502 │
│ Center_1  │ East     │ Data Processing │ 52.559742275163295 │  35.97783969816975 │
│ Center_1  │ East     │ Web Hosting     │  53.60845973063669 │ 27.885182582416054 │
│ Center_1  │ West     │ Cloud Storage   │  67.40609965208978 │  32.27891608705559 │
│ Center_1  │ West     │ Data Processing │  40.08796317401972 │ 35.195382508091186 │
│ Center_1  │ West     │ Web Hosting     │  49.16242082013778 │  35.98280493510762 │
│ NULL      │ NULL     │ Cloud Storage   │  58.60578093193317 │  32.23054243368773 │
│ NULL      │ NULL     │ Data Processing │  47.75415518833317 │ 29.978704128286843 │
│ NULL      │ NULL     │ Web Hosting     │   54.8832619326503 │ 33.790532776407105 │
│ NULL      │ NULL     │ NULL            │ 54.155927816540164 │  32.12350594507323 │
├───────────┴──────────┴─────────────────┴────────────────────┴────────────────────┤
│ 16 rows                                                                5 columns │
└──────────────────────────────────────────────────────────────────────────────────┘

Polars - Grouping Sets#

In contrast, Polars has no built-in way to do this. We’ll need to build our query part by part and concatenate all of them. We can do this with a polars.LazyFrame in order to have the query engine make as many optimizations as possible.

This code implements the same approach that DuckDB takes: collecting separate aggregations and vertically concatenating them into the same set of results.

from polars import col, concat as pl_concat

aggfuncs = [col('cpu_usage').mean(), col('mem_usage').mean()]
groupings = [
    ['center_id', 'location', 'service_type'],
    ['service_type'],
    [],
]

frames = []
ldf = pl_df.lazy()
for gs in groupings:
    if not gs:
        query = ldf.select(aggfuncs)
    else:
        query = ldf.group_by(gs).agg(aggfuncs)
    frames.append(query)
    
pl_concat(frames, how='diagonal').collect()
shape: (16, 5)
center_idlocationservice_typecpu_usagemem_usage
strstrstrf64f64
"Center_0""East""Data Processing"38.89813731.946376
"Center_1""West""Data Processing"40.08796335.195383
"Center_0""West""Cloud Storage"59.6779739.92657
"Center_1""East""Data Processing"52.55974235.97784
"Center_0""East""Web Hosting"49.51062231.492153
"Center_1""West""Cloud Storage"67.406132.278916
nullnull"Cloud Storage"58.60578132.230542
nullnull"Web Hosting"54.88326233.790533
nullnull"Data Processing"47.75415529.978704
nullnullnull54.15592832.123506

pandas - Grouping Sets#

Finally, we can reproduce this same analysis in pandas. I find both the approach and the syntax to be quite similar to that of Polars. The only downside is that we will need to eagerly compute each of these aggregations before concatenating them, a task that Polars handles in parallel.

from pandas import concat as pd_concat

results = []
aggfuncs = {'cpu_usage': 'mean', 'mem_usage': 'mean'}
groupings = [
    ['center_id', 'location', 'service_type'],
    ['service_type'],
    [],
]

for gs in groupings:
    if not gs:
        res = pd_df.agg(aggfuncs).to_frame().T
    else:
        res = pd_df.groupby(gs, as_index=False).agg(aggfuncs)
    results.append(res)
    
pd_concat(results, ignore_index=True)
center_id location service_type cpu_usage mem_usage
0 Center_0 East Cloud Storage 48.573814 34.590125
1 Center_0 East Data Processing 38.898137 31.946376
2 Center_0 East Web Hosting 49.510622 31.492153
3 Center_0 West Cloud Storage 59.677970 39.926570
4 Center_0 West Data Processing 54.954341 21.972752
5 Center_0 West Web Hosting 65.743955 40.348164
6 Center_1 East Cloud Storage 58.068076 26.131858
7 Center_1 East Data Processing 52.559742 35.977840
8 Center_1 East Web Hosting 53.608460 27.885183
9 Center_1 West Cloud Storage 67.406100 32.278916
10 Center_1 West Data Processing 40.087963 35.195383
11 Center_1 West Web Hosting 49.162421 35.982805
12 NaN NaN Cloud Storage 58.605781 32.230542
13 NaN NaN Data Processing 47.754155 29.978704
14 NaN NaN Web Hosting 54.883262 33.790533
15 NaN NaN NaN 54.155928 32.123506

Roll Up#

A roll up is a convenient way to express hierarchical grouping sets. In this case instead of manually specifying our sets we can use combinatorics. The idea is that if we wanted to see aggregation at varied hierarchical levels we can express them like so:

rollup('center_id', 'location', 'service_type')

# produces the following groupings
['center_id', 'location', 'service_type']
['location', 'service_type']
['service_type']
[]

DuckDB - Roll Up#

With DuckDB, we can use the built-in rollup syntax to express this idea:

import duckdb

duckdb.sql('''
    select center_id, location, service_type, avg(cpu_usage), avg(mem_usage)
    from pl_df
    group by rollup (center_id, location, service_type)
    order by (center_id, location, service_type);
''')
┌───────────┬──────────┬─────────────────┬────────────────────┬────────────────────┐
│ center_id │ location │  service_type   │   avg(cpu_usage)   │   avg(mem_usage)   │
│  varchar  │ varchar  │     varchar     │       double       │       double       │
├───────────┼──────────┼─────────────────┼────────────────────┼────────────────────┤
│ Center_0  │ East     │ Cloud Storage   │ 48.573814147140006 │  34.59012485147718 │
│ Center_0  │ East     │ Data Processing │  38.89813718396443 │ 31.946375637497113 │
│ Center_0  │ East     │ Web Hosting     │ 49.510621712697535 │  31.49215279790748 │
│ Center_0  │ East     │ NULL            │  47.19697970618814 │ 33.030178971279625 │
│ Center_0  │ West     │ Cloud Storage   │  59.67797041313039 │  39.92657027278433 │
│ Center_0  │ West     │ Data Processing │    54.954340526705 │ 21.972751748733323 │
│ Center_0  │ West     │ Web Hosting     │  65.74395477203684 │  40.34816417380653 │
│ Center_0  │ West     │ NULL            │  59.77552339385185 │  32.75258291517836 │
│ Center_0  │ NULL     │ NULL            │ 54.915631514527234 │  32.85983593685385 │
│ Center_1  │ East     │ Cloud Storage   │  58.06807588431277 │ 26.131857675662502 │
│ Center_1  │ East     │ Data Processing │ 52.559742275163295 │  35.97783969816975 │
│ Center_1  │ East     │ Web Hosting     │  53.60845973063669 │ 27.885182582416054 │
│ Center_1  │ East     │ NULL            │   55.3324528223062 │  28.57887496447768 │
│ Center_1  │ West     │ Cloud Storage   │  67.40609965208978 │  32.27891608705559 │
│ Center_1  │ West     │ Data Processing │  40.08796317401972 │ 35.195382508091186 │
│ Center_1  │ West     │ Web Hosting     │  49.16242082013778 │  35.98280493510762 │
│ Center_1  │ West     │ NULL            │  51.78558271393731 │ 34.511046938585025 │
│ Center_1  │ NULL     │ NULL            │  53.55901776812175 │ 31.544960951531376 │
│ NULL      │ NULL     │ NULL            │ 54.155927816540164 │  32.12350594507323 │
├───────────┴──────────┴─────────────────┴────────────────────┴────────────────────┤
│ 19 rows                                                                5 columns │
└──────────────────────────────────────────────────────────────────────────────────┘

But not let’s move onto our implementation in tools that do not have this functionality built-in. We’ll first need to generate these groupings at the Python-level so we can have both Polars and pandas operate along these groups.

Roll Up Formulation#

I always say that good usage of 3rd party Python packages (Polars, pandas, Matplotlib, etc.) requires a good understanding and application of Python. In this case, we will need to use Python to create our Rollup groups and thankfully itertools makes this quite straightforward.

from itertools import islice

def rollup(*items):
    "produce shrinking subsets of items"
    return (
        [*islice(items, i, None)] for i in range(len(items)+1)
    )

for gs in rollup('center_id', 'location', 'service_type'):
    print(gs)
['center_id', 'location', 'service_type']
['location', 'service_type']
['service_type']
[]

Now that we have a generator to create these groups, let’s apply it to our DataFrames to see it in action. The approach nearly identical to our grouping sets implementation.

Polars - Roll Up#

from polars import col, concat as pl_concat, lit

aggfuncs = [col('cpu_usage').mean(), col('mem_usage').mean()]

frames = []
ldf = pl_df.lazy()
for gs in  rollup('center_id', 'location', 'service_type'):
    if not gs:
        query = ldf.select(aggfuncs)
    else:
        query = ldf.group_by(gs).agg(aggfuncs)
    frames.append(
        query.with_columns(groupings=lit(gs))
    )
    
pl_concat(frames, how='diagonal').collect()
shape: (22, 6)
center_idlocationservice_typecpu_usagemem_usagegroupings
strstrstrf64f64list[str]
"Center_0""East""Web Hosting"49.51062231.492153["center_id", "location", "service_type"]
"Center_0""East""Cloud Storage"48.57381434.590125["center_id", "location", "service_type"]
"Center_1""East""Data Processing"52.55974235.97784["center_id", "location", "service_type"]
"Center_0""West""Data Processing"54.95434121.972752["center_id", "location", "service_type"]
"Center_1""East""Cloud Storage"58.06807626.131858["center_id", "location", "service_type"]
null"East""Web Hosting"52.16216429.158231["location", "service_type"]
nullnull"Cloud Storage"58.60578132.230542["service_type"]
nullnull"Data Processing"47.75415529.978704["service_type"]
nullnull"Web Hosting"54.88326233.790533["service_type"]
nullnullnull54.15592832.123506[]

Pandas - Roll Up#

from pandas import concat, Series as pd_Series

results = []
aggfuncs = {'cpu_usage': 'mean', 'mem_usage': 'mean'}
for gs in rollup('center_id', 'location', 'service_type'):
    if not gs:
        res = pd_df.agg(aggfuncs).to_frame().T
    else:
        res = pd_df.groupby(list(gs), as_index=False).agg(aggfuncs)
    results.append(
        res.assign(
            groupings=lambda d: pd_Series([gs] * len(d), index=d.index)
        )
    )
    
concat(results, ignore_index=True)
center_id location service_type cpu_usage mem_usage groupings
0 Center_0 East Cloud Storage 48.573814 34.590125 [center_id, location, service_type]
1 Center_0 East Data Processing 38.898137 31.946376 [center_id, location, service_type]
2 Center_0 East Web Hosting 49.510622 31.492153 [center_id, location, service_type]
3 Center_0 West Cloud Storage 59.677970 39.926570 [center_id, location, service_type]
4 Center_0 West Data Processing 54.954341 21.972752 [center_id, location, service_type]
5 Center_0 West Web Hosting 65.743955 40.348164 [center_id, location, service_type]
6 Center_1 East Cloud Storage 58.068076 26.131858 [center_id, location, service_type]
7 Center_1 East Data Processing 52.559742 35.977840 [center_id, location, service_type]
8 Center_1 East Web Hosting 53.608460 27.885183 [center_id, location, service_type]
9 Center_1 West Cloud Storage 67.406100 32.278916 [center_id, location, service_type]
10 Center_1 West Data Processing 40.087963 35.195383 [center_id, location, service_type]
11 Center_1 West Web Hosting 49.162421 35.982805 [center_id, location, service_type]
12 NaN East Cloud Storage 54.270371 29.515165 [location, service_type]
13 NaN East Data Processing 47.436640 34.466041 [location, service_type]
14 NaN East Web Hosting 52.162164 29.158231 [location, service_type]
15 NaN West Cloud Storage 64.025043 35.624765 [location, service_type]
16 NaN West Data Processing 47.875113 28.269243 [location, service_type]
17 NaN West Web Hosting 57.453188 38.165485 [location, service_type]
18 NaN NaN Cloud Storage 58.605781 32.230542 [service_type]
19 NaN NaN Data Processing 47.754155 29.978704 [service_type]
20 NaN NaN Web Hosting 54.883262 33.790533 [service_type]
21 NaN NaN NaN 54.155928 32.123506 []

Cube#

For our final recreation, we are going to implement the “group by cube” syntax that DuckDB offers. Let’s take a look at the result set.

Duckdb - Cube#

import duckdb


duckdb.sql('''
    select center_id, location, service_type, avg(cpu_usage), avg(mem_usage)
    from pl_df
    group by cube (center_id, location, service_type)
    order by (center_id, location, service_type);
''')
┌───────────┬──────────┬─────────────────┬────────────────────┬────────────────────┐
│ center_id │ location │  service_type   │   avg(cpu_usage)   │   avg(mem_usage)   │
│  varchar  │ varchar  │     varchar     │       double       │       double       │
├───────────┼──────────┼─────────────────┼────────────────────┼────────────────────┤
│ Center_0  │ East     │ Cloud Storage   │ 48.573814147140006 │  34.59012485147718 │
│ Center_0  │ East     │ Data Processing │  38.89813718396443 │ 31.946375637497113 │
│ Center_0  │ East     │ Web Hosting     │ 49.510621712697535 │  31.49215279790748 │
│ Center_0  │ East     │ NULL            │  47.19697970618814 │ 33.030178971279625 │
│ Center_0  │ West     │ Cloud Storage   │  59.67797041313039 │  39.92657027278433 │
│ Center_0  │ West     │ Data Processing │    54.954340526705 │ 21.972751748733323 │
│ Center_0  │ West     │ Web Hosting     │  65.74395477203684 │  40.34816417380653 │
│ Center_0  │ West     │ NULL            │  59.77552339385185 │  32.75258291517836 │
│ Center_0  │ NULL     │ Cloud Storage   │ 53.755753737935514 │  37.08046604808719 │
│ Center_0  │ NULL     │ Data Processing │ 51.513725524689164 │ 24.109956867754132 │
│  ·        │  ·       │      ·          │          ·         │          ·         │
│  ·        │  ·       │      ·          │          ·         │          ·         │
│  ·        │  ·       │      ·          │          ·         │          ·         │
│ NULL      │ East     │ Web Hosting     │  52.16216395959934 │ 29.158230893765968 │
│ NULL      │ East     │ NULL            │ 52.259051867328274 │  30.26047870038065 │
│ NULL      │ West     │ Cloud Storage   │  64.02504311004503 │ 35.624764793311925 │
│ NULL      │ West     │ Data Processing │  47.87511321590248 │  28.26924258652279 │
│ NULL      │ West     │ Web Hosting     │  57.45318779608731 │  38.16548455445707 │
│ NULL      │ West     │ NULL            │  55.70791722953172 │  33.64780096345814 │
│ NULL      │ NULL     │ Cloud Storage   │  58.60578093193317 │  32.23054243368773 │
│ NULL      │ NULL     │ Data Processing │  47.75415518833317 │ 29.978704128286843 │
│ NULL      │ NULL     │ Web Hosting     │   54.8832619326503 │ 33.790532776407105 │
│ NULL      │ NULL     │ NULL            │ 54.155927816540164 │  32.12350594507323 │
├───────────┴──────────┴─────────────────┴────────────────────┴────────────────────┤
│ 36 rows (20 shown)                                                     5 columns │
└──────────────────────────────────────────────────────────────────────────────────┘

Cube Formulation#

In terms of combinatorics, “cube” is effectively a reversed powerset of the inputted elements. Where a powerset is all possible combinations of different sizes from a given set of inputs. The following pseudocode snippet displays the expected inputs and outputs for this pattern.

cube('center_id', 'location', 'service_type')

# produces the following groupings
['center_id', 'location', 'service_type']
['center_id', 'location']
['center_id', 'service_type']
['location', 'service_type']
['center_id']
['location']
['service_type']
[]

The above pattern is incredibly similar to the powerset recipe in itertools with the caveat that cube is reversed so that it begins with the largest group instead of the smallest.

Our rewrite in Python would look like:

from itertools import chain, combinations

def cube(*items):
    """reversed version of itertools powerset recipe"""
    for size in range(len(items)+1, -1, -1):
        for combo in combinations(items, size):
            yield [*combo]

for gs in cube('center_id', 'location', 'service_type'):
    print(gs)
['center_id', 'location', 'service_type']
['center_id', 'location']
['center_id', 'service_type']
['location', 'service_type']
['center_id']
['location']
['service_type']
[]

Polars Cube#

from polars import col, concat as pl_concat, lit

aggfuncs = [col('cpu_usage').mean(), col('mem_usage').mean()]

frames = []
ldf = pl_df.lazy()
for gs in cube('center_id', 'location', 'service_type'):
    if not gs:
        query = ldf.select(aggfuncs)
    else:
        query = ldf.group_by(gs).agg(aggfuncs)
    frames.append(
        query.with_columns(groupings=lit(gs))
    )
    
pl_concat(frames, how='diagonal').collect()
shape: (36, 6)
center_idlocationservice_typecpu_usagemem_usagegroupings
strstrstrf64f64list[str]
"Center_1""West""Web Hosting"49.16242135.982805["center_id", "location", "service_type"]
"Center_1""East""Cloud Storage"58.06807626.131858["center_id", "location", "service_type"]
"Center_1""West""Data Processing"40.08796335.195383["center_id", "location", "service_type"]
"Center_1""East""Data Processing"52.55974235.97784["center_id", "location", "service_type"]
"Center_0""West""Cloud Storage"59.6779739.92657["center_id", "location", "service_type"]
null"East"null52.25905230.260479["location"]
nullnull"Data Processing"47.75415529.978704["service_type"]
nullnull"Cloud Storage"58.60578132.230542["service_type"]
nullnull"Web Hosting"54.88326233.790533["service_type"]
nullnullnull54.15592832.123506[]

Pandas Cube#

from pandas import concat, Series as pd_Series

results = []
aggfuncs = {'cpu_usage': 'mean', 'mem_usage': 'mean'}
for gs in cube('center_id', 'location', 'service_type'):
    if not gs:
        res = pd_df.agg(aggfuncs).to_frame().T
    else:
        res = pd_df.groupby(list(gs), as_index=False).agg(aggfuncs)
    results.append(
        res.assign(
            groupings=lambda d: pd_Series([gs] * len(d), index=d.index)
        )
    )
    
concat(results, ignore_index=True).fillna('') # fillna to view output easily
center_id location service_type cpu_usage mem_usage groupings
0 Center_0 East Cloud Storage 48.573814 34.590125 [center_id, location, service_type]
1 Center_0 East Data Processing 38.898137 31.946376 [center_id, location, service_type]
2 Center_0 East Web Hosting 49.510622 31.492153 [center_id, location, service_type]
3 Center_0 West Cloud Storage 59.677970 39.926570 [center_id, location, service_type]
4 Center_0 West Data Processing 54.954341 21.972752 [center_id, location, service_type]
5 Center_0 West Web Hosting 65.743955 40.348164 [center_id, location, service_type]
6 Center_1 East Cloud Storage 58.068076 26.131858 [center_id, location, service_type]
7 Center_1 East Data Processing 52.559742 35.977840 [center_id, location, service_type]
8 Center_1 East Web Hosting 53.608460 27.885183 [center_id, location, service_type]
9 Center_1 West Cloud Storage 67.406100 32.278916 [center_id, location, service_type]
10 Center_1 West Data Processing 40.087963 35.195383 [center_id, location, service_type]
11 Center_1 West Web Hosting 49.162421 35.982805 [center_id, location, service_type]
12 Center_0 East 47.196980 33.030179 [center_id, location]
13 Center_0 West 59.775523 32.752583 [center_id, location]
14 Center_1 East 55.332453 28.578875 [center_id, location]
15 Center_1 West 51.785583 34.511047 [center_id, location]
16 Center_0 Cloud Storage 53.755754 37.080466 [center_id, service_type]
17 Center_0 Data Processing 51.513726 24.109957 [center_id, service_type]
18 Center_0 Web Hosting 59.250622 36.805760 [center_id, service_type]
19 Center_1 Cloud Storage 62.070086 28.766311 [center_id, service_type]
20 Center_1 Data Processing 44.245223 35.456202 [center_id, service_type]
21 Center_1 Web Hosting 51.607742 31.529113 [center_id, service_type]
22 East Cloud Storage 54.270371 29.515165 [location, service_type]
23 East Data Processing 47.436640 34.466041 [location, service_type]
24 East Web Hosting 52.162164 29.158231 [location, service_type]
25 West Cloud Storage 64.025043 35.624765 [location, service_type]
26 West Data Processing 47.875113 28.269243 [location, service_type]
27 West Web Hosting 57.453188 38.165485 [location, service_type]
28 Center_0 54.915632 32.859836 [center_id]
29 Center_1 53.559018 31.544961 [center_id]
30 East 52.259052 30.260479 [location]
31 West 55.707917 33.647801 [location]
32 Cloud Storage 58.605781 32.230542 [service_type]
33 Data Processing 47.754155 29.978704 [service_type]
34 Web Hosting 54.883262 33.790533 [service_type]
35 54.155928 32.123506 []

Wrap-Up#

There you have it: replicating DuckDB’s Group By “Grouping Sets”, “Rollup”, and “Cube” functionality! While both pandas and Polars requires a bit more manual work, it’s entirely feasible with the right approach. A bit of understanding of pure Python takes us a long way, even when it comes to analytical tasks where we make a large use of 3rd party packages.

What did you think about these approaches? Let me know on the DUTC Discord. Talk to you all next week!