Flexibility & Ergonomics#
Hi all, welcome back to Cameron’s Corner! This week, I want to talk about flexibility and ergonomics.
Oftentimes, we want to write code that is flexible to adapt to the ever-changing problems we are presented with. This often means that we have to write code that anticipates different formulations of an existing business problem. On the other hand, we should also endeavor to write code that is readily usable by our colleagues or other end-users. While these forces—flexibility and ergonomics—may feel like they pull in opposite directions, we should always strive to find a solution where these ideas work in tandem. The most generalized approach we can take to satisfy this is to design APIs with two primary layers of abstraction:
Flexibility - functions & data structuring that accomplish problem requirements.
Generalized helper functions that make minimal assumptions about their input
namedtuple/dataclasses for data structuring, avoid full classes at this point
Will require scripting to stitch together the pieces
Ergonomics - abstractions that ease the scripting composition of the above.
Full classes to interact with Python vocabulary
Should be composed (almost entirely) of the utility layer
Typically bundles useful utilities together in pipelines
Admittedly, this dichotomy is not perfect, but it does provide guidance to start a new project and transition it from a script into reusable components. I often take issue with “tutorial-driven development” as there ends up being a much larger focus on what the end result should look like rather than the inherent structuring in the data/problem itself. This is a great way to code yourself into a corner and create some useless demoware.
Instead, when creating an API, I start with flexibility. That is, I start by writing functions for reusable and composable solutions for the problem at hand. When creating these functions, I keep in mind two principles: KISS (keep it simple, stupid) and function purity (functions should accomplish one task, not many). Of course, these principles at their core are just principles, so do take them with a grain of salt. Once I have a set of utilities that help me accomplish my goal, I tend to reuse and refine them each time I need to write a script that uses them.
After spending time creating flexible utilities, you will have a much better understanding of the problem and the actors therein. This is the knowledge that will guide your development of ergonomics, the abstractions to simplify writing scripts involving the flexible layer.
Group By Sets in Polars#
I want to demonstrate how I apply the above concepts to some code I shared in a previous post, 2024-06-12_groupby-sets. While this is a fairly small problem, I did make an intentional decision on how I should separate the concerns of the problem and design utilities to help solve it.
This problem was an implementation of SQL’s group by sets & group by cube/rollup features in Polars.
Let’s make a data set to work with below:
from pandas import DataFrame, Timestamp, to_timedelta
from numpy.random import default_rng
from polars import from_pandas, Config
Config(float_precision=3)
rng = default_rng(0)
center_ids = [f"Center_{i}" for i in range(2)]
locations = ["East", "West"]
service_types = ["Web Hosting", "Data Processing", "Cloud Storage"]
pd_df = DataFrame({
"center_id": rng.choice(center_ids, size=(size := 100)),
"location": rng.choice(locations, size=size),
"service_type": rng.choice(service_types, size=size),
"timestamp": (
Timestamp.now() - to_timedelta(rng.integers(0, 3_650, size=size), unit='min')
).floor('min'),
"cpu_usage": rng.uniform(0, 100, size=size),
"mem_usage": rng.uniform(0, 64, size=size),
})
ldf = from_pandas(pd_df).lazy()
ldf.collect()
center_id | location | service_type | timestamp | cpu_usage | mem_usage |
---|---|---|---|---|---|
str | str | str | datetime[ns] | f64 | f64 |
"Center_1" | "East" | "Web Hosting" | 2024-09-04 00:12:00 | 31.968 | 57.395 |
"Center_1" | "West" | "Data Processing" | 2024-09-04 06:29:00 | 18.751 | 37.332 |
"Center_1" | "East" | "Web Hosting" | 2024-09-04 00:52:00 | 67.253 | 2.574 |
"Center_0" | "East" | "Web Hosting" | 2024-09-03 08:53:00 | 19.511 | 45.535 |
"Center_0" | "West" | "Cloud Storage" | 2024-09-04 00:25:00 | 57.769 | 36.418 |
… | … | … | … | … | … |
"Center_1" | "West" | "Cloud Storage" | 2024-09-02 08:05:00 | 58.630 | 11.460 |
"Center_0" | "East" | "Cloud Storage" | 2024-09-02 09:59:00 | 12.269 | 47.886 |
"Center_0" | "West" | "Data Processing" | 2024-09-01 19:35:00 | 93.377 | 5.548 |
"Center_0" | "East" | "Cloud Storage" | 2024-09-02 22:54:00 | 68.405 | 27.255 |
"Center_1" | "West" | "Cloud Storage" | 2024-09-02 19:12:00 | 82.378 | 25.392 |
Flexibility#
Let’s think about the problem space that we have to recreate group by sets with rollup and cube. Examining some valid SQL syntax, let’s note the features we want to capture:
Arbitrary Sets
import duckdb
duckdb.query('''
select center_id, location, mean(columns('.*_usage'))
from ldf
group by grouping sets (
(center_id, location),
(center_id),
(),
)
order by center_id, location
''').pl()
center_id | location | mean(ldf.cpu_usage) | mean(ldf.mem_usage) |
---|---|---|---|
str | str | f64 | f64 |
"Center_0" | "East" | 47.197 | 33.030 |
"Center_0" | "West" | 59.776 | 32.753 |
"Center_0" | null | 54.916 | 32.860 |
"Center_1" | "East" | 55.332 | 28.579 |
"Center_1" | "West" | 51.786 | 34.511 |
"Center_1" | null | 53.559 | 31.545 |
null | null | 54.156 | 32.124 |
Combinatoric Sets (rollup & cube)
duckdb.query('''
select center_id, location, mean(columns('.*_usage'))
from ldf
group by rollup(center_id, location)
order by center_id, location
''').pl()
center_id | location | mean(ldf.cpu_usage) | mean(ldf.mem_usage) |
---|---|---|---|
str | str | f64 | f64 |
"Center_0" | "East" | 47.197 | 33.030 |
"Center_0" | "West" | 59.776 | 32.753 |
"Center_0" | null | 54.916 | 32.860 |
"Center_1" | "East" | 55.332 | 28.579 |
"Center_1" | "West" | 51.786 | 34.511 |
"Center_1" | null | 53.559 | 31.545 |
null | null | 54.156 | 32.124 |
Partial Sets
duckdb.query('''
select service_type, center_id, location, mean(columns('.*_usage'))
from ldf
group by service_type, rollup(center_id, location)
order by center_id, location
''').pl()
service_type | center_id | location | mean(ldf.cpu_usage) | mean(ldf.mem_usage) |
---|---|---|---|---|
str | str | str | f64 | f64 |
"Cloud Storage" | "Center_0" | "East" | 48.574 | 34.590 |
"Data Processing" | "Center_0" | "East" | 38.898 | 31.946 |
"Web Hosting" | "Center_0" | "East" | 49.511 | 31.492 |
"Cloud Storage" | "Center_0" | "West" | 59.678 | 39.927 |
"Data Processing" | "Center_0" | "West" | 54.954 | 21.973 |
… | … | … | … | … |
"Web Hosting" | "Center_1" | null | 51.608 | 31.529 |
"Cloud Storage" | "Center_1" | null | 62.070 | 28.766 |
"Web Hosting" | null | null | 54.883 | 33.791 |
"Cloud Storage" | null | null | 58.606 | 32.231 |
"Data Processing" | null | null | 47.754 | 29.979 |
So far, we have seen 3 uses for grouping:
arbitrary sets
combinatoric set (rollup & cube)
partial combinatoric sets
Looking at these applications, I see a hierarchy of needs, where if I can define a function that works with any arbitrary sets, then I should be able to write other functions that can generate those sets and pass them into the primary entry point.
Grouping Sets#
Since we’ve identified an important entry point, let’s go ahead and define it in our first pass:
from polars import lit, concat, struct
from polars import selectors as cs
def grouping_sets(df, groupings, exprs):
frames = []
for i, gs in enumerate(groupings):
if not gs:
query = df.select(exprs)
else:
query = df.group_by(gs).agg(exprs)
frames.append(
query.with_columns(groupings=struct(id=lit(i), members=lit(gs)))
)
return concat(frames, how='diagonal')
grouping_sets(
ldf,
groupings=[['center_id', 'location'], ['service_type']],
exprs=cs.ends_with('usage').mean(),
).collect()
center_id | location | cpu_usage | mem_usage | groupings | service_type |
---|---|---|---|---|---|
str | str | f64 | f64 | struct[2] | str |
"Center_1" | "East" | 55.332 | 28.579 | {0,["center_id", "location"]} | null |
"Center_1" | "West" | 51.786 | 34.511 | {0,["center_id", "location"]} | null |
"Center_0" | "East" | 47.197 | 33.030 | {0,["center_id", "location"]} | null |
"Center_0" | "West" | 59.776 | 32.753 | {0,["center_id", "location"]} | null |
null | null | 58.606 | 32.231 | {1,["service_type"]} | "Cloud Storage" |
null | null | 54.883 | 33.791 | {1,["service_type"]} | "Web Hosting" |
null | null | 47.754 | 29.979 | {1,["service_type"]} | "Data Processing" |
This function will do a lot of heavy lifting, so I suspect it will be called
in different manners. Notice that this works with either a
polars.DataFrame
or polars.LazyFrame
. Additionally, the only requirement for
groupings
is that it is iterable, and exprs
just needs to be a valid polars.Expression
.
By intentionally relaxing our constraints at this phase, we are maximizing the
flexibility of the arguments.
Combinatorics#
Now let’s shift our attention to populating the groups that we will feed into
our above function. This usage pattern has the convenience of decoupling our
combinatoric iterators away from the grouping_set
logic, which keeps the functions
pure and simplifies testing these entities as well.
from itertools import islice, combinations
def cube(*items):
"""powerset of items from largest to smallest
>>> list(cube('a', 'b'))
[['a', 'b'], ['a'], ['b'], []]
"""
for size in range(len(items)+1, -1, -1):
for combo in combinations(items, size):
yield [*combo]
def rollup(*items):
"""produce shrinking subsets of items
>>> list(rollup('a', 'b', 'c', 'd'))
[['a', 'b', 'c', 'd'], ['b', 'c', 'd'], ['c', 'd'], ['d'], []]
"""
return (
[*islice(items, i, None)] for i in range(len(items)+1)
)
grouping_sets(
ldf,
groupings=rollup('center_id', 'service_type'),
exprs=cs.ends_with('usage').mean(),
).collect()
center_id | service_type | cpu_usage | mem_usage | groupings |
---|---|---|---|---|
str | str | f64 | f64 | struct[2] |
"Center_0" | "Data Processing" | 51.514 | 24.110 | {0,["center_id", "service_type"]} |
"Center_1" | "Cloud Storage" | 62.070 | 28.766 | {0,["center_id", "service_type"]} |
"Center_1" | "Web Hosting" | 51.608 | 31.529 | {0,["center_id", "service_type"]} |
"Center_1" | "Data Processing" | 44.245 | 35.456 | {0,["center_id", "service_type"]} |
"Center_0" | "Cloud Storage" | 53.756 | 37.080 | {0,["center_id", "service_type"]} |
"Center_0" | "Web Hosting" | 59.251 | 36.806 | {0,["center_id", "service_type"]} |
null | "Data Processing" | 47.754 | 29.979 | {1,["service_type"]} |
null | "Cloud Storage" | 58.606 | 32.231 | {1,["service_type"]} |
null | "Web Hosting" | 54.883 | 33.791 | {1,["service_type"]} |
null | null | 54.156 | 32.124 | {2,[]} |
Partial Sets#
This case is a bit different than the previous because it does not cleanly fit into our above mechanisms. This leaves us with two options:
Implement the partial logic into
grouping_sets
begs the question: how do I toggle it on/off?
Create a separate
partial_grouping_sets
function specifically for this case
The first option concerns itself with a modality: do we want to pass our input through the partial logic? Can we toggle it? Do we just create a near-parallel/different entry point?
In my opinion, the second option is going to be easier to work with and test; however, we also have a third option: another layer of convenience. If we consider partially creating sets, we can just view this problem as a composition alongside the existing rollup/cube implementations. This means that we can write another completely decoupled function to handle this layer of the code.
from itertools import repeat, chain
from typing import Iterator
def partial_sets(*items):
"""combine grouped values alongside rollup, cube, and other iterators
>>> list(partial_sets('center_id', rollup('location', 'service_type')))
[['center_id', 'location', 'service_type'],
['center_id', 'service_type'],
['center_id']]
"""
groupings = (
gs if isinstance(gs, Iterator) else repeat([gs])
for gs in items
)
return (
[*chain.from_iterable(parts)] for parts in zip(*groupings)
)
[*partial_sets('center_id', rollup('location', 'service_type'))]
[['center_id', 'location', 'service_type'],
['center_id', 'service_type'],
['center_id']]
Again, by decoupling the set-generation logic from their evaluation against the
polars.LazyFrame
, we are making code that is easier to test and maintain.
Finally, let’s see how this approach performs in action:
grouping_sets(
ldf,
groupings=partial_sets('center_id', rollup('location', 'service_type')),
exprs=cs.ends_with('usage').mean(),
).collect()
center_id | location | service_type | cpu_usage | mem_usage | groupings |
---|---|---|---|---|---|
str | str | str | f64 | f64 | struct[2] |
"Center_0" | "East" | "Data Processing" | 38.898 | 31.946 | {0,["center_id", "location", "service_type"]} |
"Center_1" | "West" | "Cloud Storage" | 67.406 | 32.279 | {0,["center_id", "location", "service_type"]} |
"Center_1" | "East" | "Web Hosting" | 53.608 | 27.885 | {0,["center_id", "location", "service_type"]} |
"Center_0" | "West" | "Data Processing" | 54.954 | 21.973 | {0,["center_id", "location", "service_type"]} |
"Center_1" | "West" | "Data Processing" | 40.088 | 35.195 | {0,["center_id", "location", "service_type"]} |
… | … | … | … | … | … |
"Center_0" | null | "Web Hosting" | 59.251 | 36.806 | {1,["center_id", "service_type"]} |
"Center_0" | null | "Data Processing" | 51.514 | 24.110 | {1,["center_id", "service_type"]} |
"Center_0" | null | "Cloud Storage" | 53.756 | 37.080 | {1,["center_id", "service_type"]} |
"Center_1" | null | null | 53.559 | 31.545 | {2,["center_id"]} |
"Center_0" | null | null | 54.916 | 32.860 | {2,["center_id"]} |
While this works as a flexible (and easily testable) solution to the given problem, it is not as ergonomic as it could be, especially in the style that Polars code exists today. Thankfully, we should be able to take the flexible utilities we created above and massage them into an ergonomic abstraction that our Polars users can access seamlessly.
Ergonomics#
Hi all, welcome back to Cameron’s Corner! This week, I want to continue the discussion above about flexibility and ergonomics. Let’s take a look at ergonomics this time.
To implement the ideas I explored last week more fluently (which is a silly term, but we’ll go with it), Polars offers a namespace registration mechanism. This mechanism essentially monkey patches the class allowing us to dynamically add attributes at runtime.
from polars import LazyFrame, concat, struct
from polars.api import register_lazyframe_namespace
@register_lazyframe_namespace('grouping')
class ExpandedLazyGroupBy:
def __init__(self, ldf: LazyFrame):
self._ldf = ldf
self._contexts = []
def __call__(self, *groupings, maintain_order=False):
if any(isinstance(gs, Iterator) for gs in groupings):
groupings = partial_sets(*groupings)
return self.sets(*groupings, maintain_order=maintain_order)
def sets(self, *groupings, maintain_order=False):
contexts = []
for g in groupings:
ctx = (
self._ldf.group_by(g, maintain_order=maintain_order).agg
if g else self._ldf.select
)
contexts.append((g, ctx))
self._contexts = contexts
return self
def agg(self, *aggs, **named_aggs):
frames = []
for i, (gs, ctx) in enumerate(self._contexts):
frames.append(
ctx(*aggs, **named_aggs)
.with_columns(
groupings=struct(id=lit(i), members=lit(gs))
)
)
return concat(frames, how='diagonal_relaxed')
This lets us access our new methods under the LazyFrame.grouping
attribute.
Whenever we do this, we get a new instance of the ExpandedLazyGroupBy
.
This means that we can use this object within typical Polars syntax (method chaining).
Additionally, we have rolled in the convenience of partial_sets
into this interface
as the default.
(
ldf.grouping('center_id', rollup('location'))
.agg(cs.ends_with('usage').mean())
.collect()
)
center_id | location | cpu_usage | mem_usage | groupings |
---|---|---|---|---|
str | str | f64 | f64 | struct[2] |
"Center_0" | "East" | 47.197 | 33.030 | {0,["center_id", "location"]} |
"Center_0" | "West" | 59.776 | 32.753 | {0,["center_id", "location"]} |
"Center_1" | "West" | 51.786 | 34.511 | {0,["center_id", "location"]} |
"Center_1" | "East" | 55.332 | 28.579 | {0,["center_id", "location"]} |
"Center_0" | null | 54.916 | 32.860 | {1,["center_id"]} |
"Center_1" | null | 53.559 | 31.545 | {1,["center_id"]} |
However, when including features like our partial_sets
convenience, it is often
useful to be able to turn them off. I opted for a separate function that allows you
to explicitly pass pre-defined groupings sets via .sets
. This skips over the input
coercion step that partial_sets
adds on.
(
ldf.grouping.sets('center_id', 'location')
.agg(cs.ends_with('usage').mean())
.collect()
)
center_id | cpu_usage | mem_usage | groupings | location |
---|---|---|---|---|
str | f64 | f64 | struct[2] | str |
"Center_0" | 54.916 | 32.860 | {0,"center_id"} | null |
"Center_1" | 53.559 | 31.545 | {0,"center_id"} | null |
null | 52.259 | 30.260 | {1,"location"} | "East" |
null | 55.708 | 33.648 | {1,"location"} | "West" |
Wrap-Up#
That’s all we have time for this week! Stay tuned next time for some more fun with Polars.
What do you think about flexibility and ergonomics? Let me know on the DUTC Discord. Talk to you all next week!