Tabular Group By Sets
Hi all, welcome back to Cameron's Corner! This week, I want to replicate some convenient analytical functionality from DuckDB in both pandas and Polars.
Before we get started, I want to let you know about our upcoming public seminar series, “(Even More) Python Basics for Experts.” Join James in this covering (even more) Python basics that any aspiring Python expert needs to know in order to make their code more effective and efficient. He’ll tackle what’s real, how we can tell it’s real, and how we can do less work.
Back to the topic at hand!
Today, I set out to replicate DuckDB's "Grouping Sets" functionality in Polars and pandas, which is an analytical function to group by pre-defined, nested, or permuted sets of columns across which we would aggregate.
While this convenient functionality exists in DuckDB, it's actually not built into either pandas or Polars (there's even an open issue in Polars requesting this functionality). I want to discuss why understanding and using Python as a language is important, even when the bulk of our work is analytical.
Data
Let’s start by modeling some data that have a few categorical (grouping) columns for us to compare across.
from pandas import DataFrame, Timestamp, to_timedelta
from numpy.random import default_rng
from polars import from_pandas
rng = default_rng(0)
center_ids = [f"Center_{i}" for i in range(2)]
locations = ["East", "West"]
service_types = ["Web Hosting", "Data Processing", "Cloud Storage"]
pd_df = DataFrame({
"center_id": rng.choice(center_ids, size=(size := 100)),
"location": rng.choice(locations, size=size),
"service_type": rng.choice(service_types, size=size),
"timestamp": Timestamp.now() - to_timedelta(rng.integers(0, 3_650, size=size), unit='D'),
"cpu_usage": rng.uniform(0, 100, size=size),
"mem_usage": rng.uniform(0, 64, size=size),
})
pl_df = from_pandas(pd_df)
pd_df.head()
center_id | location | service_type | timestamp | cpu_usage | mem_usage | |
---|---|---|---|---|---|---|
0 | Center_1 | East | Web Hosting | 2023-11-22 15:50:03.116313 | 31.968164 | 57.395279 |
1 | Center_1 | West | Data Processing | 2024-12-03 15:50:03.116313 | 18.750772 | 37.332483 |
2 | Center_1 | East | Web Hosting | 2024-01-01 15:50:03.116313 | 67.252663 | 2.573966 |
3 | Center_0 | East | Web Hosting | 2021-05-17 15:50:03.116313 | 19.510740 | 45.535157 |
4 | Center_0 | West | Cloud Storage | 2023-12-05 15:50:03.116313 | 57.768789 | 36.417655 |
Grouping Sets
In this first exercise, we want to understand the average 'cpu_usage' and 'mem_usage' for the following cross sections:
within each combination of 'center_id', 'location', 'service_type'
within 'service_type', ignoring 'center_id' and 'location'
across all centers, regardless of any other categories.
DuckDB - Grouping Sets
DuckDB provides an incredibly easy syntax: group by grouping sets where we can input these grouping sets with ease.
import duckdb
duckdb.sql('''
select center_id, location, service_type, avg(cpu_usage), avg(mem_usage)
from pl_df
group by grouping sets (
(center_id, location, service_type),
(service_type),
()
)
order by (center_id, location, service_type);
''')
┌───────────┬──────────┬─────────────────┬────────────────────┬────────────────────┐
│ center_id │ location │ service_type │ avg(cpu_usage) │ avg(mem_usage) │
│ varchar │ varchar │ varchar │ double │ double │
├───────────┼──────────┼─────────────────┼────────────────────┼────────────────────┤
│ Center_0 │ East │ Cloud Storage │ 48.573814147140006 │ 34.59012485147718 │
│ Center_0 │ East │ Data Processing │ 38.89813718396443 │ 31.946375637497113 │
│ Center_0 │ East │ Web Hosting │ 49.510621712697535 │ 31.49215279790748 │
│ Center_0 │ West │ Cloud Storage │ 59.67797041313039 │ 39.92657027278433 │
│ Center_0 │ West │ Data Processing │ 54.954340526705 │ 21.972751748733323 │
│ Center_0 │ West │ Web Hosting │ 65.74395477203684 │ 40.34816417380653 │
│ Center_1 │ East │ Cloud Storage │ 58.06807588431277 │ 26.131857675662502 │
│ Center_1 │ East │ Data Processing │ 52.559742275163295 │ 35.97783969816975 │
│ Center_1 │ East │ Web Hosting │ 53.60845973063669 │ 27.885182582416054 │
│ Center_1 │ West │ Cloud Storage │ 67.40609965208978 │ 32.27891608705559 │
│ Center_1 │ West │ Data Processing │ 40.08796317401972 │ 35.195382508091186 │
│ Center_1 │ West │ Web Hosting │ 49.16242082013778 │ 35.98280493510762 │
│ NULL │ NULL │ Cloud Storage │ 58.60578093193317 │ 32.23054243368773 │
│ NULL │ NULL │ Data Processing │ 47.75415518833317 │ 29.978704128286843 │
│ NULL │ NULL │ Web Hosting │ 54.8832619326503 │ 33.790532776407105 │
│ NULL │ NULL │ NULL │ 54.155927816540164 │ 32.12350594507323 │
├───────────┴──────────┴─────────────────┴────────────────────┴────────────────────┤
│ 16 rows 5 columns │
└──────────────────────────────────────────────────────────────────────────────────┘
Polars - Grouping Sets
In contrast, Polars has no built-in way to do this. We’ll need to build our query part by part and concatenate all of them. We can do this with a polars.LazyFrame in order to have the query engine make as many optimizations as possible.
This code implements the same approach that DuckDB takes: collecting separate aggregations and vertically concatenating them into the same set of results.
from polars import col, concat as pl_concat
aggfuncs = [col('cpu_usage').mean(), col('mem_usage').mean()]
groupings = [
['center_id', 'location', 'service_type'],
['service_type'],
[],
]
frames = []
ldf = pl_df.lazy()
for gs in groupings:
if not gs:
query = ldf.select(aggfuncs)
else:
query = ldf.group_by(gs).agg(aggfuncs)
frames.append(query)
pl_concat(frames, how='diagonal').collect()
center_id | location | service_type | cpu_usage | mem_usage |
---|---|---|---|---|
str | str | str | f64 | f64 |
"Center_1" | "West" | "Data Processing" | 40.087963 | 35.195383 |
"Center_1" | "East" | "Data Processing" | 52.559742 | 35.97784 |
"Center_0" | "West" | "Cloud Storage" | 59.67797 | 39.92657 |
"Center_0" | "West" | "Data Processing" | 54.954341 | 21.972752 |
"Center_0" | "East" | "Cloud Storage" | 48.573814 | 34.590125 |
… | … | … | … | … |
"Center_1" | "East" | "Web Hosting" | 53.60846 | 27.885183 |
null | null | "Web Hosting" | 54.883262 | 33.790533 |
null | null | "Cloud Storage" | 58.605781 | 32.230542 |
null | null | "Data Processing" | 47.754155 | 29.978704 |
null | null | null | 54.155928 | 32.123506 |
pandas - Grouping Sets
Finally, we can reproduce this same analysis in pandas. I find both the approach and the syntax to be quite similar to that of Polars. The only downside is that we will need to eagerly compute each of these aggregations before concatenating them, a task that Polars handles in parallel.
from pandas import concat as pd_concat
results = []
aggfuncs = {'cpu_usage': 'mean', 'mem_usage': 'mean'}
groupings = [
['center_id', 'location', 'service_type'],
['service_type'],
[],
]
for gs in groupings:
if not gs:
res = pd_df.agg(aggfuncs).to_frame().T
else:
res = pd_df.groupby(gs, as_index=False).agg(aggfuncs)
results.append(res)
pd_concat(results, ignore_index=True)
center_id | location | service_type | cpu_usage | mem_usage | |
---|---|---|---|---|---|
0 | Center_0 | East | Cloud Storage | 48.573814 | 34.590125 |
1 | Center_0 | East | Data Processing | 38.898137 | 31.946376 |
2 | Center_0 | East | Web Hosting | 49.510622 | 31.492153 |
3 | Center_0 | West | Cloud Storage | 59.677970 | 39.926570 |
4 | Center_0 | West | Data Processing | 54.954341 | 21.972752 |
5 | Center_0 | West | Web Hosting | 65.743955 | 40.348164 |
6 | Center_1 | East | Cloud Storage | 58.068076 | 26.131858 |
7 | Center_1 | East | Data Processing | 52.559742 | 35.977840 |
8 | Center_1 | East | Web Hosting | 53.608460 | 27.885183 |
9 | Center_1 | West | Cloud Storage | 67.406100 | 32.278916 |
10 | Center_1 | West | Data Processing | 40.087963 | 35.195383 |
11 | Center_1 | West | Web Hosting | 49.162421 | 35.982805 |
12 | NaN | NaN | Cloud Storage | 58.605781 | 32.230542 |
13 | NaN | NaN | Data Processing | 47.754155 | 29.978704 |
14 | NaN | NaN | Web Hosting | 54.883262 | 33.790533 |
15 | NaN | NaN | NaN | 54.155928 | 32.123506 |
Roll Up
A roll up is a convenient way to express hierarchical grouping sets. In this case instead of manually specifying our sets we can use combinatorics. The idea is that if we wanted to see aggregation at varied hierarchical levels we can express them like so:
rollup('center_id', 'location', 'service_type')
# produces the following groupings
['center_id', 'location', 'service_type']
['location', 'service_type']
['service_type']
[]
DuckDB - Roll Up
With DuckDB, we can use the built-in rollup syntax to express this idea:
import duckdb
duckdb.sql('''
select center_id, location, service_type, avg(cpu_usage), avg(mem_usage)
from pl_df
group by rollup (center_id, location, service_type)
order by (center_id, location, service_type);
''')
┌───────────┬──────────┬─────────────────┬────────────────────┬────────────────────┐
│ center_id │ location │ service_type │ avg(cpu_usage) │ avg(mem_usage) │
│ varchar │ varchar │ varchar │ double │ double │
├───────────┼──────────┼─────────────────┼────────────────────┼────────────────────┤
│ Center_0 │ East │ Cloud Storage │ 48.573814147140006 │ 34.59012485147718 │
│ Center_0 │ East │ Data Processing │ 38.89813718396443 │ 31.946375637497113 │
│ Center_0 │ East │ Web Hosting │ 49.510621712697535 │ 31.49215279790748 │
│ Center_0 │ East │ NULL │ 47.19697970618814 │ 33.030178971279625 │
│ Center_0 │ West │ Cloud Storage │ 59.67797041313039 │ 39.92657027278433 │
│ Center_0 │ West │ Data Processing │ 54.954340526705 │ 21.972751748733323 │
│ Center_0 │ West │ Web Hosting │ 65.74395477203684 │ 40.34816417380653 │
│ Center_0 │ West │ NULL │ 59.77552339385185 │ 32.75258291517836 │
│ Center_0 │ NULL │ NULL │ 54.915631514527234 │ 32.85983593685385 │
│ Center_1 │ East │ Cloud Storage │ 58.06807588431277 │ 26.131857675662502 │
│ Center_1 │ East │ Data Processing │ 52.559742275163295 │ 35.97783969816975 │
│ Center_1 │ East │ Web Hosting │ 53.60845973063669 │ 27.885182582416054 │
│ Center_1 │ East │ NULL │ 55.3324528223062 │ 28.57887496447768 │
│ Center_1 │ West │ Cloud Storage │ 67.40609965208978 │ 32.27891608705559 │
│ Center_1 │ West │ Data Processing │ 40.08796317401972 │ 35.195382508091186 │
│ Center_1 │ West │ Web Hosting │ 49.16242082013778 │ 35.98280493510762 │
│ Center_1 │ West │ NULL │ 51.78558271393731 │ 34.511046938585025 │
│ Center_1 │ NULL │ NULL │ 53.55901776812175 │ 31.544960951531376 │
│ NULL │ NULL │ NULL │ 54.155927816540164 │ 32.12350594507323 │
├───────────┴──────────┴─────────────────┴────────────────────┴────────────────────┤
│ 19 rows 5 columns │
└──────────────────────────────────────────────────────────────────────────────────┘
But not let's move onto our implementation in tools that do not have this functionality built-in. We'll first need to generate these groupings at the Python-level so we can have both Polars and pandas operate along these groups.
Roll Up Formulation
I always say that good usage of 3rd party Python packages (Polars, pandas, Matplotlib, etc.) requires a good understanding and application of Python. In this case, we will need to use Python to create our Rollup groups and thankfully itertools makes this quite straightforward.
from itertools import islice
def rollup(*items):
"produce shrinking subsets of items"
return (
[*islice(items, i, None)] for i in range(len(items)+1)
)
for gs in rollup('center_id', 'location', 'service_type'):
print(gs)
['center_id', 'location', 'service_type']
['location', 'service_type']
['service_type']
[]
Now that we have a generator to create these groups, let's apply it to our DataFrames to see it in action. The approach nearly identical to our grouping sets implementation.
Polars - Roll Up
from polars import col, concat as pl_concat, lit
aggfuncs = [col('cpu_usage').mean(), col('mem_usage').mean()]
frames = []
ldf = pl_df.lazy()
for gs in rollup('center_id', 'location', 'service_type'):
if not gs:
query = ldf.select(aggfuncs)
else:
query = ldf.group_by(gs).agg(aggfuncs)
frames.append(
query.with_columns(groupings=lit(gs))
)
pl_concat(frames, how='diagonal_relaxed').collect()
center_id | location | service_type | cpu_usage | mem_usage | groupings |
---|---|---|---|---|---|
str | str | str | f64 | f64 | list[str] |
"Center_0" | "West" | "Cloud Storage" | 59.67797 | 39.92657 | ["center_id", "location", "service_type"] |
"Center_0" | "West" | "Data Processing" | 54.954341 | 21.972752 | ["center_id", "location", "service_type"] |
"Center_1" | "East" | "Cloud Storage" | 58.068076 | 26.131858 | ["center_id", "location", "service_type"] |
"Center_1" | "East" | "Web Hosting" | 53.60846 | 27.885183 | ["center_id", "location", "service_type"] |
"Center_1" | "West" | "Data Processing" | 40.087963 | 35.195383 | ["center_id", "location", "service_type"] |
… | … | … | … | … | … |
null | "East" | "Cloud Storage" | 54.270371 | 29.515165 | ["location", "service_type"] |
null | null | "Web Hosting" | 54.883262 | 33.790533 | ["service_type"] |
null | null | "Cloud Storage" | 58.605781 | 32.230542 | ["service_type"] |
null | null | "Data Processing" | 47.754155 | 29.978704 | ["service_type"] |
null | null | null | 54.155928 | 32.123506 | [] |
Pandas - Roll Up
from pandas import concat, Series as pd_Series
results = []
aggfuncs = {'cpu_usage': 'mean', 'mem_usage': 'mean'}
for gs in rollup('center_id', 'location', 'service_type'):
if not gs:
res = pd_df.agg(aggfuncs).to_frame().T
else:
res = pd_df.groupby(list(gs), as_index=False).agg(aggfuncs)
results.append(
res.assign(
groupings=lambda d: pd_Series([gs] * len(d), index=d.index)
)
)
concat(results, ignore_index=True)
center_id | location | service_type | cpu_usage | mem_usage | groupings | |
---|---|---|---|---|---|---|
0 | Center_0 | East | Cloud Storage | 48.573814 | 34.590125 | [center_id, location, service_type] |
1 | Center_0 | East | Data Processing | 38.898137 | 31.946376 | [center_id, location, service_type] |
2 | Center_0 | East | Web Hosting | 49.510622 | 31.492153 | [center_id, location, service_type] |
3 | Center_0 | West | Cloud Storage | 59.677970 | 39.926570 | [center_id, location, service_type] |
4 | Center_0 | West | Data Processing | 54.954341 | 21.972752 | [center_id, location, service_type] |
5 | Center_0 | West | Web Hosting | 65.743955 | 40.348164 | [center_id, location, service_type] |
6 | Center_1 | East | Cloud Storage | 58.068076 | 26.131858 | [center_id, location, service_type] |
7 | Center_1 | East | Data Processing | 52.559742 | 35.977840 | [center_id, location, service_type] |
8 | Center_1 | East | Web Hosting | 53.608460 | 27.885183 | [center_id, location, service_type] |
9 | Center_1 | West | Cloud Storage | 67.406100 | 32.278916 | [center_id, location, service_type] |
10 | Center_1 | West | Data Processing | 40.087963 | 35.195383 | [center_id, location, service_type] |
11 | Center_1 | West | Web Hosting | 49.162421 | 35.982805 | [center_id, location, service_type] |
12 | NaN | East | Cloud Storage | 54.270371 | 29.515165 | [location, service_type] |
13 | NaN | East | Data Processing | 47.436640 | 34.466041 | [location, service_type] |
14 | NaN | East | Web Hosting | 52.162164 | 29.158231 | [location, service_type] |
15 | NaN | West | Cloud Storage | 64.025043 | 35.624765 | [location, service_type] |
16 | NaN | West | Data Processing | 47.875113 | 28.269243 | [location, service_type] |
17 | NaN | West | Web Hosting | 57.453188 | 38.165485 | [location, service_type] |
18 | NaN | NaN | Cloud Storage | 58.605781 | 32.230542 | [service_type] |
19 | NaN | NaN | Data Processing | 47.754155 | 29.978704 | [service_type] |
20 | NaN | NaN | Web Hosting | 54.883262 | 33.790533 | [service_type] |
21 | NaN | NaN | NaN | 54.155928 | 32.123506 | [] |
Cube
For our final recreation, we are going to implement the "group by cube" syntax that DuckDB offers. Let's take a look at the result set.
Duckdb - Cube
import duckdb
duckdb.sql('''
select center_id, location, service_type, avg(cpu_usage), avg(mem_usage)
from pl_df
group by cube (center_id, location, service_type)
order by (center_id, location, service_type);
''')
┌───────────┬──────────┬─────────────────┬────────────────────┬────────────────────┐
│ center_id │ location │ service_type │ avg(cpu_usage) │ avg(mem_usage) │
│ varchar │ varchar │ varchar │ double │ double │
├───────────┼──────────┼─────────────────┼────────────────────┼────────────────────┤
│ Center_0 │ East │ Cloud Storage │ 48.573814147140006 │ 34.59012485147718 │
│ Center_0 │ East │ Data Processing │ 38.89813718396443 │ 31.946375637497113 │
│ Center_0 │ East │ Web Hosting │ 49.510621712697535 │ 31.49215279790748 │
│ Center_0 │ East │ NULL │ 47.19697970618814 │ 33.030178971279625 │
│ Center_0 │ West │ Cloud Storage │ 59.67797041313039 │ 39.92657027278433 │
│ Center_0 │ West │ Data Processing │ 54.954340526705 │ 21.972751748733323 │
│ Center_0 │ West │ Web Hosting │ 65.74395477203684 │ 40.34816417380653 │
│ Center_0 │ West │ NULL │ 59.77552339385185 │ 32.75258291517836 │
│ Center_0 │ NULL │ Cloud Storage │ 53.755753737935514 │ 37.08046604808719 │
│ Center_0 │ NULL │ Data Processing │ 51.513725524689164 │ 24.109956867754132 │
│ · │ · │ · │ · │ · │
│ · │ · │ · │ · │ · │
│ · │ · │ · │ · │ · │
│ NULL │ East │ Web Hosting │ 52.16216395959934 │ 29.158230893765968 │
│ NULL │ East │ NULL │ 52.259051867328274 │ 30.26047870038065 │
│ NULL │ West │ Cloud Storage │ 64.02504311004503 │ 35.624764793311925 │
│ NULL │ West │ Data Processing │ 47.87511321590248 │ 28.26924258652279 │
│ NULL │ West │ Web Hosting │ 57.45318779608731 │ 38.16548455445707 │
│ NULL │ West │ NULL │ 55.70791722953172 │ 33.64780096345814 │
│ NULL │ NULL │ Cloud Storage │ 58.60578093193317 │ 32.23054243368773 │
│ NULL │ NULL │ Data Processing │ 47.75415518833317 │ 29.978704128286843 │
│ NULL │ NULL │ Web Hosting │ 54.8832619326503 │ 33.790532776407105 │
│ NULL │ NULL │ NULL │ 54.155927816540164 │ 32.12350594507323 │
├───────────┴──────────┴─────────────────┴────────────────────┴────────────────────┤
│ 36 rows (20 shown) 5 columns │
└──────────────────────────────────────────────────────────────────────────────────┘
Cube Formulation
In terms of combinatorics, "cube" is effectively a reversed powerset of the inputted elements. Where a powerset is all possible combinations of different sizes from a given set of inputs. The following pseudocode snippet displays the expected inputs and outputs for this pattern.
cube('center_id', 'location', 'service_type')
# produces the following groupings
['center_id', 'location', 'service_type']
['center_id', 'location']
['center_id', 'service_type']
['location', 'service_type']
['center_id']
['location']
['service_type']
[]
The above pattern is incredibly similar to the powerset recipe in itertools with the caveat that cube is reversed so that it begins with the largest group instead of the smallest.
Our rewrite in Python would look like:
from itertools import chain, combinations
def cube(*items):
"""reversed version of itertools powerset recipe"""
for size in range(len(items)+1, -1, -1):
for combo in combinations(items, size):
yield [*combo]
for gs in cube('center_id', 'location', 'service_type'):
print(gs)
['center_id', 'location', 'service_type']
['center_id', 'location']
['center_id', 'service_type']
['location', 'service_type']
['center_id']
['location']
['service_type']
[]
Polars Cube
from polars import col, concat as pl_concat, lit
aggfuncs = [col('cpu_usage').mean(), col('mem_usage').mean()]
frames = []
ldf = pl_df.lazy()
for gs in cube('center_id', 'location', 'service_type'):
if not gs:
query = ldf.select(aggfuncs)
else:
query = ldf.group_by(gs).agg(aggfuncs)
frames.append(
query.with_columns(groupings=lit(gs))
)
pl_concat(frames, how='diagonal_relaxed').collect()
center_id | location | service_type | cpu_usage | mem_usage | groupings |
---|---|---|---|---|---|
str | str | str | f64 | f64 | list[str] |
"Center_0" | "West" | "Data Processing" | 54.954341 | 21.972752 | ["center_id", "location", "service_type"] |
"Center_0" | "West" | "Cloud Storage" | 59.67797 | 39.92657 | ["center_id", "location", "service_type"] |
"Center_0" | "East" | "Web Hosting" | 49.510622 | 31.492153 | ["center_id", "location", "service_type"] |
"Center_1" | "East" | "Data Processing" | 52.559742 | 35.97784 | ["center_id", "location", "service_type"] |
"Center_1" | "West" | "Data Processing" | 40.087963 | 35.195383 | ["center_id", "location", "service_type"] |
… | … | … | … | … | … |
null | "East" | null | 52.259052 | 30.260479 | ["location"] |
null | null | "Cloud Storage" | 58.605781 | 32.230542 | ["service_type"] |
null | null | "Web Hosting" | 54.883262 | 33.790533 | ["service_type"] |
null | null | "Data Processing" | 47.754155 | 29.978704 | ["service_type"] |
null | null | null | 54.155928 | 32.123506 | [] |
Pandas Cube
from pandas import concat, Series as pd_Series
results = []
aggfuncs = {'cpu_usage': 'mean', 'mem_usage': 'mean'}
for gs in cube('center_id', 'location', 'service_type'):
if not gs:
res = pd_df.agg(aggfuncs).to_frame().T
else:
res = pd_df.groupby(list(gs), as_index=False).agg(aggfuncs)
results.append(
res.assign(
groupings=lambda d: pd_Series([gs] * len(d), index=d.index)
)
)
concat(results, ignore_index=True).fillna('') # fillna to view output easily
center_id | location | service_type | cpu_usage | mem_usage | groupings | |
---|---|---|---|---|---|---|
0 | Center_0 | East | Cloud Storage | 48.573814 | 34.590125 | [center_id, location, service_type] |
1 | Center_0 | East | Data Processing | 38.898137 | 31.946376 | [center_id, location, service_type] |
2 | Center_0 | East | Web Hosting | 49.510622 | 31.492153 | [center_id, location, service_type] |
3 | Center_0 | West | Cloud Storage | 59.677970 | 39.926570 | [center_id, location, service_type] |
4 | Center_0 | West | Data Processing | 54.954341 | 21.972752 | [center_id, location, service_type] |
5 | Center_0 | West | Web Hosting | 65.743955 | 40.348164 | [center_id, location, service_type] |
6 | Center_1 | East | Cloud Storage | 58.068076 | 26.131858 | [center_id, location, service_type] |
7 | Center_1 | East | Data Processing | 52.559742 | 35.977840 | [center_id, location, service_type] |
8 | Center_1 | East | Web Hosting | 53.608460 | 27.885183 | [center_id, location, service_type] |
9 | Center_1 | West | Cloud Storage | 67.406100 | 32.278916 | [center_id, location, service_type] |
10 | Center_1 | West | Data Processing | 40.087963 | 35.195383 | [center_id, location, service_type] |
11 | Center_1 | West | Web Hosting | 49.162421 | 35.982805 | [center_id, location, service_type] |
12 | Center_0 | East | 47.196980 | 33.030179 | [center_id, location] | |
13 | Center_0 | West | 59.775523 | 32.752583 | [center_id, location] | |
14 | Center_1 | East | 55.332453 | 28.578875 | [center_id, location] | |
15 | Center_1 | West | 51.785583 | 34.511047 | [center_id, location] | |
16 | Center_0 | Cloud Storage | 53.755754 | 37.080466 | [center_id, service_type] | |
17 | Center_0 | Data Processing | 51.513726 | 24.109957 | [center_id, service_type] | |
18 | Center_0 | Web Hosting | 59.250622 | 36.805760 | [center_id, service_type] | |
19 | Center_1 | Cloud Storage | 62.070086 | 28.766311 | [center_id, service_type] | |
20 | Center_1 | Data Processing | 44.245223 | 35.456202 | [center_id, service_type] | |
21 | Center_1 | Web Hosting | 51.607742 | 31.529113 | [center_id, service_type] | |
22 | East | Cloud Storage | 54.270371 | 29.515165 | [location, service_type] | |
23 | East | Data Processing | 47.436640 | 34.466041 | [location, service_type] | |
24 | East | Web Hosting | 52.162164 | 29.158231 | [location, service_type] | |
25 | West | Cloud Storage | 64.025043 | 35.624765 | [location, service_type] | |
26 | West | Data Processing | 47.875113 | 28.269243 | [location, service_type] | |
27 | West | Web Hosting | 57.453188 | 38.165485 | [location, service_type] | |
28 | Center_0 | 54.915632 | 32.859836 | [center_id] | ||
29 | Center_1 | 53.559018 | 31.544961 | [center_id] | ||
30 | East | 52.259052 | 30.260479 | [location] | ||
31 | West | 55.707917 | 33.647801 | [location] | ||
32 | Cloud Storage | 58.605781 | 32.230542 | [service_type] | ||
33 | Data Processing | 47.754155 | 29.978704 | [service_type] | ||
34 | Web Hosting | 54.883262 | 33.790533 | [service_type] | ||
35 | 54.155928 | 32.123506 | [] |
Wrap-Up
There you have it: replicating DuckDB's Group By "Grouping Sets", "Rollup", and "Cube" functionality! While both pandas and Polars requires a bit more manual work, it's entirely feasible with the right approach. A bit of understanding of pure Python takes us a long way, even when it comes to analytical tasks where we make a large use of 3rd party packages.
What did you think about these approaches? Let me know on the DUTC Discord. Talk to you all next week!