Flexibility & Ergonomics

Hi all, welcome back to Cameron's Corner! This week, I want to talk about flexibility and ergonomics.

Oftentimes, we want to write code that is flexible to adapt to the ever-changing problems we are presented with. This often means that we have to write code that anticipates different formulations of an existing business problem. On the other hand, we should also endeavor to write code that is readily usable by our colleagues or other end-users. While these forces—flexibility and ergonomics—may feel like they pull in opposite directions, we should always strive to find a solution where these ideas work in tandem. The most generalized approach we can take to satisfy this is to design APIs with two primary layers of abstraction:

Flexibility - functions & data structuring that accomplish problem requirements.
- Generalized helper functions that make minimal assumptions about their input
- namedtuple/dataclasses for data structuring, avoid full classes at this point
- Will require scripting to stitch together the pieces
Ergonomics - abstractions that ease the scripting composition of the above.
- Full classes to interact with Python vocabulary
- Should be composed (almost entirely) of the utility layer
- Typically bundles useful utilities together in pipelines

Admittedly, this dichotomy is not perfect, but it does provide guidance to start a new project and transition it from a script into reusable components. I often take issue with "tutorial-driven development" as there ends up being a much larger focus on what the end result should look like rather than the inherent structuring in the data/problem itself. This is a great way to code yourself into a corner and create some useless demoware.

Instead, when creating an API, I start with flexibility. That is, I start by writing functions for reusable and composable solutions for the problem at hand. When creating these functions, I keep in mind two principles: KISS (keep it simple, stupid) and function purity (functions should accomplish one task, not many). Of course, these principles at their core are just principles, so do take them with a grain of salt. Once I have a set of utilities that help me accomplish my goal, I tend to reuse and refine them each time I need to write a script that uses them.

After spending time creating flexible utilities, you will have a much better understanding of the problem and the actors therein. This is the knowledge that will guide your development of ergonomics, the abstractions to simplify writing scripts involving the flexible layer.

Group By Sets in Polars

I want to demonstrate how I apply the above concepts to some code I shared in a previous post, 2024-06-12_groupby-sets. While this is a fairly small problem, I did make an intentional decision on how I should separate the concerns of the problem and design utilities to help solve it.

This problem was an implementation of SQL's group by sets & group by cube/rollup features in Polars.

Let's make a data set to work with below:

from pandas import DataFrame, Timestamp, to_timedelta
from numpy.random import default_rng
from polars import from_pandas, Config

Config(float_precision=3)
rng = default_rng(0)

center_ids = [f"Center_{i}" for i in range(2)]
locations = ["East", "West"]
service_types = ["Web Hosting", "Data Processing", "Cloud Storage"]

pd_df = DataFrame({
    "center_id":    rng.choice(center_ids, size=(size := 100)),
    "location":     rng.choice(locations, size=size),
    "service_type": rng.choice(service_types, size=size),
    "timestamp":    (
        Timestamp.now() - to_timedelta(rng.integers(0, 3_650, size=size), unit='min')
    ).floor('min'),
    "cpu_usage":    rng.uniform(0, 100, size=size),
    "mem_usage":    rng.uniform(0, 64, size=size),
})

ldf = from_pandas(pd_df).lazy()

ldf.collect()

shape: (100, 6)

center_id	location	service_type	timestamp	cpu_usage	mem_usage
str	str	str	datetime[ns]	f64	f64
"Center_1"	"East"	"Web Hosting"	2025-01-08 08:57:00	31.968	57.395
"Center_1"	"West"	"Data Processing"	2025-01-08 15:14:00	18.751	37.332
"Center_1"	"East"	"Web Hosting"	2025-01-08 09:37:00	67.253	2.574
"Center_0"	"East"	"Web Hosting"	2025-01-07 17:38:00	19.511	45.535
"Center_0"	"West"	"Cloud Storage"	2025-01-08 09:10:00	57.769	36.418
…	…	…	…	…	…
"Center_1"	"West"	"Cloud Storage"	2025-01-06 16:50:00	58.630	11.460
"Center_0"	"East"	"Cloud Storage"	2025-01-06 18:44:00	12.269	47.886
"Center_0"	"West"	"Data Processing"	2025-01-06 04:20:00	93.377	5.548
"Center_0"	"East"	"Cloud Storage"	2025-01-07 07:39:00	68.405	27.255
"Center_1"	"West"	"Cloud Storage"	2025-01-07 03:57:00	82.378	25.392

Flexibility

Let's think about the problem space that we have to recreate group by sets with rollup and cube. Examining some valid SQL syntax, let's note the features we want to capture:

Arbitrary Sets

import duckdb

duckdb.query('''
    select center_id, location, mean(columns('.*_usage'))
    from ldf
    group by grouping sets (
        (center_id, location),
        (center_id),
        (),
    )
    order by center_id, location
''').pl()

shape: (7, 4)

center_id	location	cpu_usage	mem_usage
str	str	f64	f64
"Center_0"	"East"	47.197	33.030
"Center_0"	"West"	59.776	32.753
"Center_0"	null	54.916	32.860
"Center_1"	"East"	55.332	28.579
"Center_1"	"West"	51.786	34.511
"Center_1"	null	53.559	31.545
null	null	54.156	32.124

Combinatoric Sets (rollup & cube)

duckdb.query('''
    select center_id, location, mean(columns('.*_usage'))
    from ldf
    group by rollup(center_id, location)
    order by center_id, location
''').pl()

shape: (7, 4)

center_id	location	cpu_usage	mem_usage
str	str	f64	f64
"Center_0"	"East"	47.197	33.030
"Center_0"	"West"	59.776	32.753
"Center_0"	null	54.916	32.860
"Center_1"	"East"	55.332	28.579
"Center_1"	"West"	51.786	34.511
"Center_1"	null	53.559	31.545
null	null	54.156	32.124

Partial Sets

duckdb.query('''
    select service_type, center_id, location, mean(columns('.*_usage'))
    from ldf
    group by service_type, rollup(center_id, location)
    order by center_id, location
''').pl()

shape: (21, 5)

service_type	center_id	location	cpu_usage	mem_usage
str	str	str	f64	f64
"Web Hosting"	"Center_0"	"East"	49.511	31.492
"Data Processing"	"Center_0"	"East"	38.898	31.946
"Cloud Storage"	"Center_0"	"East"	48.574	34.590
"Cloud Storage"	"Center_0"	"West"	59.678	39.927
"Web Hosting"	"Center_0"	"West"	65.744	40.348
…	…	…	…	…
"Data Processing"	"Center_1"	null	44.245	35.456
"Web Hosting"	"Center_1"	null	51.608	31.529
"Cloud Storage"	null	null	58.606	32.231
"Web Hosting"	null	null	54.883	33.791
"Data Processing"	null	null	47.754	29.979

So far, we have seen 3 uses for grouping:

arbitrary sets
combinatoric set (rollup & cube)
partial combinatoric sets

Looking at these applications, I see a hierarchy of needs, where if I can define a function that works with any arbitrary sets, then I should be able to write other functions that can generate those sets and pass them into the primary entry point.

Grouping Sets

Since we've identified an important entry point, let's go ahead and define it in our first pass:

from polars import lit, concat, struct
from polars import selectors as cs

def grouping_sets(df, groupings, exprs):
    frames = []
    for i, gs in enumerate(groupings):
        if not gs:
            query = df.select(exprs)
        else:
            query = df.group_by(gs).agg(exprs)
        frames.append(
            query.with_columns(groupings=struct(id=lit(i), members=lit(gs)))
        )
    return concat(frames, how='diagonal_relaxed')

grouping_sets(
    ldf,
    groupings=[['center_id', 'location'], ['service_type']],
    exprs=cs.ends_with('usage').mean(),
).collect()

shape: (7, 6)

center_id	location	cpu_usage	mem_usage	groupings	service_type
str	str	f64	f64	struct[2]	str
"Center_1"	"East"	55.332	28.579	{0,["center_id", "location"]}	null
"Center_0"	"West"	59.776	32.753	{0,["center_id", "location"]}	null
"Center_1"	"West"	51.786	34.511	{0,["center_id", "location"]}	null
"Center_0"	"East"	47.197	33.030	{0,["center_id", "location"]}	null
null	null	47.754	29.979	{1,["service_type"]}	"Data Processing"
null	null	54.883	33.791	{1,["service_type"]}	"Web Hosting"
null	null	58.606	32.231	{1,["service_type"]}	"Cloud Storage"

This function will do a lot of heavy lifting, so I suspect it will be called in different manners. Notice that this works with either a polars.DataFrame or polars.LazyFrame. Additionally, the only requirement for groupings is that it is iterable, and exprs just needs to be a valid polars.Expression. By intentionally relaxing our constraints at this phase, we are maximizing the flexibility of the arguments.

Combinatorics

Now let's shift our attention to populating the groups that we will feed into our above function. This usage pattern has the convenience of decoupling our combinatoric iterators away from the grouping_set logic, which keeps the functions pure and simplifies testing these entities as well.

from itertools import islice, combinations

def cube(*items):
    """powerset of items from largest to smallest
    
    >>> list(cube('a', 'b'))
    [['a', 'b'], ['a'], ['b'], []]
    """
    for size in range(len(items)+1, -1, -1):
        for combo in combinations(items, size):
            yield [*combo]

def rollup(*items):
    """produce shrinking subsets of items
    
    >>> list(rollup('a', 'b', 'c', 'd'))
    [['a', 'b', 'c', 'd'], ['b', 'c', 'd'], ['c', 'd'], ['d'], []]
    """
    return (
        [*islice(items, i, None)] for i in range(len(items)+1)
    )

grouping_sets(
    ldf,
    groupings=rollup('center_id', 'service_type'),
    exprs=cs.ends_with('usage').mean(),
).collect()

shape: (10, 5)

center_id	service_type	cpu_usage	mem_usage	groupings
str	str	f64	f64	struct[2]
"Center_0"	"Data Processing"	51.514	24.110	{0,["center_id", "service_type"]}
"Center_1"	"Web Hosting"	51.608	31.529	{0,["center_id", "service_type"]}
"Center_0"	"Cloud Storage"	53.756	37.080	{0,["center_id", "service_type"]}
"Center_0"	"Web Hosting"	59.251	36.806	{0,["center_id", "service_type"]}
"Center_1"	"Cloud Storage"	62.070	28.766	{0,["center_id", "service_type"]}
"Center_1"	"Data Processing"	44.245	35.456	{0,["center_id", "service_type"]}
null	"Cloud Storage"	58.606	32.231	{1,["service_type"]}
null	"Web Hosting"	54.883	33.791	{1,["service_type"]}
null	"Data Processing"	47.754	29.979	{1,["service_type"]}
null	null	54.156	32.124	{2,[]}

Partial Sets

This case is a bit different than the previous because it does not cleanly fit into our above mechanisms. This leaves us with two options:

Implement the partial logic into grouping_sets
- begs the question: how do I toggle it on/off?
Create a separate partial_grouping_sets function specifically for this case

The first option concerns itself with a modality: do we want to pass our input through the partial logic? Can we toggle it? Do we just create a near-parallel/different entry point?

In my opinion, the second option is going to be easier to work with and test; however, we also have a third option: another layer of convenience. If we consider partially creating sets, we can just view this problem as a composition alongside the existing rollup/cube implementations. This means that we can write another completely decoupled function to handle this layer of the code.

from itertools import repeat, chain
from typing import Iterator

def partial_sets(*items):
    """combine grouped values alongside rollup, cube, and other iterators
    
    >>> list(partial_sets('center_id', rollup('location', 'service_type')))
    [['center_id', 'location', 'service_type'],
     ['center_id', 'service_type'],
     ['center_id']]
    """
    groupings = (
        gs if isinstance(gs, Iterator) else repeat([gs])
        for gs in items
    )
    return (
        [*chain.from_iterable(parts)] for parts in zip(*groupings)
    )

[*partial_sets('center_id', rollup('location', 'service_type'))]

[['center_id', 'location', 'service_type'],
 ['center_id', 'service_type'],
 ['center_id']]

Again, by decoupling the set-generation logic from their evaluation against the polars.LazyFrame, we are making code that is easier to test and maintain.

Finally, let's see how this approach performs in action:

grouping_sets(
    ldf,
    groupings=partial_sets('center_id', rollup('location', 'service_type')),
    exprs=cs.ends_with('usage').mean(),
).collect()

shape: (20, 6)

center_id	location	service_type	cpu_usage	mem_usage	groupings
str	str	str	f64	f64	struct[2]
"Center_0"	"West"	"Data Processing"	54.954	21.973	{0,["center_id", "location", "service_type"]}
"Center_1"	"West"	"Cloud Storage"	67.406	32.279	{0,["center_id", "location", "service_type"]}
"Center_1"	"West"	"Data Processing"	40.088	35.195	{0,["center_id", "location", "service_type"]}
"Center_0"	"East"	"Web Hosting"	49.511	31.492	{0,["center_id", "location", "service_type"]}
"Center_1"	"West"	"Web Hosting"	49.162	35.983	{0,["center_id", "location", "service_type"]}
…	…	…	…	…	…
"Center_1"	null	"Cloud Storage"	62.070	28.766	{1,["center_id", "service_type"]}
"Center_0"	null	"Web Hosting"	59.251	36.806	{1,["center_id", "service_type"]}
"Center_1"	null	"Data Processing"	44.245	35.456	{1,["center_id", "service_type"]}
"Center_1"	null	null	53.559	31.545	{2,["center_id"]}
"Center_0"	null	null	54.916	32.860	{2,["center_id"]}

While this works as a flexible (and easily testable) solution to the given problem, it is not as ergonomic as it could be, especially in the style that Polars code exists today. Thankfully, we should be able to take the flexible utilities we created above and massage them into an ergonomic abstraction that our Polars users can access seamlessly.

Ergonomics

Hi all, welcome back to Cameron’s Corner! This week, I want to continue the discussion above about flexibility and ergonomics. Let's take a look at ergonomics this time.

To implement the ideas I explored last week more fluently (which is a silly term, but we'll go with it), Polars offers a namespace registration mechanism. This mechanism essentially monkey patches the class allowing us to dynamically add attributes at runtime.

from polars import LazyFrame, concat, struct
from polars.api import register_lazyframe_namespace

@register_lazyframe_namespace('grouping')
class ExpandedLazyGroupBy:
    def __init__(self, ldf: LazyFrame):
        self._ldf = ldf
        self._contexts = []

    def __call__(self, *groupings, maintain_order=False):
        if any(isinstance(gs, Iterator) for gs in groupings):
            groupings = partial_sets(*groupings)
        return self.sets(*groupings, maintain_order=maintain_order)
    
    def sets(self, *groupings, maintain_order=False):
        contexts = []
        for g in groupings:
            ctx = (
                self._ldf.group_by(g, maintain_order=maintain_order).agg
                if g else self._ldf.select
            )
            contexts.append((g, ctx))
        self._contexts = contexts
        return self

    def agg(self, *aggs, **named_aggs):
        frames = []
        ordering = []
        for i, (gs, ctx) in enumerate(self._contexts):
            if isinstance(gs, str):
                ordering.append(gs)
            else:
                ordering.extend([g for g in gs if g not in ordering])
                
            frames.append(
                ctx(*aggs, **named_aggs)
                .with_columns(
                    groupings=struct(id=lit(i), members=lit(gs))
                )
            )
                
        return (
            concat(frames, how='diagonal_relaxed')
            # ensure 'groupings' and grouped on columns are first
            .select(
                'groupings', *ordering, cs.all() - cs.by_name('groupings', *ordering)
            )
        )

This lets us access our new methods under the LazyFrame.grouping attribute. Whenever we do this, we get a new instance of the ExpandedLazyGroupBy.

This means that we can use this object within typical Polars syntax (method chaining). Additionally, we have rolled in the convenience of partial_sets into this interface as the default.

(
    ldf.grouping('center_id', rollup('location'))
    .agg(cs.ends_with('usage').mean())
    .collect()
)

shape: (6, 5)

groupings	center_id	location	cpu_usage	mem_usage
struct[2]	str	str	f64	f64
{0,["center_id", "location"]}	"Center_1"	"West"	51.786	34.511
{0,["center_id", "location"]}	"Center_0"	"West"	59.776	32.753
{0,["center_id", "location"]}	"Center_0"	"East"	47.197	33.030
{0,["center_id", "location"]}	"Center_1"	"East"	55.332	28.579
{1,["center_id"]}	"Center_0"	null	54.916	32.860
{1,["center_id"]}	"Center_1"	null	53.559	31.545

However, when including features like our partial_sets convenience, it is often useful to be able to turn them off. I opted for a separate function that allows you to explicitly pass pre-defined groupings sets via .sets. This skips over the input coercion step that partial_sets adds on.

(
    ldf.grouping.sets('center_id', 'location')
    .agg(cs.ends_with('usage').mean())
    .collect()
)

shape: (4, 5)

groupings	center_id	location	cpu_usage	mem_usage
struct[2]	str	str	f64	f64
{0,"center_id"}	"Center_0"	null	54.916	32.860
{0,"center_id"}	"Center_1"	null	53.559	31.545
{1,"location"}	null	"East"	52.259	30.260
{1,"location"}	null	"West"	55.708	33.648

Wrap-Up

That's all we have time for this week! Stay tuned next time for some more fun with Polars.

What do you think about flexibility and ergonomics? Let me know on the DUTC Discord. Talk to you all next week!