Flexibility & Ergonomics#

Hi all, welcome back to Cameron’s Corner! This week, I want to talk about flexibility and ergonomics.

Oftentimes, we want to write code that is flexible to adapt to the ever-changing problems we are presented with. This often means that we have to write code that anticipates different formulations of an existing business problem. On the other hand, we should also endeavor to write code that is readily usable by our colleagues or other end-users. While these forces—flexibility and ergonomics—may feel like they pull in opposite directions, we should always strive to find a solution where these ideas work in tandem. The most generalized approach we can take to satisfy this is to design APIs with two primary layers of abstraction:

  1. Flexibility - functions & data structuring that accomplish problem requirements.

    • Generalized helper functions that make minimal assumptions about their input

    • namedtuple/dataclasses for data structuring, avoid full classes at this point

    • Will require scripting to stitch together the pieces

  2. Ergonomics - abstractions that ease the scripting composition of the above.

    • Full classes to interact with Python vocabulary

    • Should be composed (almost entirely) of the utility layer

    • Typically bundles useful utilities together in pipelines

Admittedly, this dichotomy is not perfect, but it does provide guidance to start a new project and transition it from a script into reusable components. I often take issue with “tutorial-driven development” as there ends up being a much larger focus on what the end result should look like rather than the inherent structuring in the data/problem itself. This is a great way to code yourself into a corner and create some useless demoware.

Instead, when creating an API, I start with flexibility. That is, I start by writing functions for reusable and composable solutions for the problem at hand. When creating these functions, I keep in mind two principles: KISS (keep it simple, stupid) and function purity (functions should accomplish one task, not many). Of course, these principles at their core are just principles, so do take them with a grain of salt. Once I have a set of utilities that help me accomplish my goal, I tend to reuse and refine them each time I need to write a script that uses them.

After spending time creating flexible utilities, you will have a much better understanding of the problem and the actors therein. This is the knowledge that will guide your development of ergonomics, the abstractions to simplify writing scripts involving the flexible layer.

Group By Sets in Polars#

I want to demonstrate how I apply the above concepts to some code I shared in a previous post, 2024-06-12_groupby-sets. While this is a fairly small problem, I did make an intentional decision on how I should separate the concerns of the problem and design utilities to help solve it.

This problem was an implementation of SQL’s group by sets & group by cube/rollup features in Polars.

Let’s make a data set to work with below:

from pandas import DataFrame, Timestamp, to_timedelta
from numpy.random import default_rng
from polars import from_pandas, Config

Config(float_precision=3)
rng = default_rng(0)

center_ids = [f"Center_{i}" for i in range(2)]
locations = ["East", "West"]
service_types = ["Web Hosting", "Data Processing", "Cloud Storage"]

pd_df = DataFrame({
    "center_id":    rng.choice(center_ids, size=(size := 100)),
    "location":     rng.choice(locations, size=size),
    "service_type": rng.choice(service_types, size=size),
    "timestamp":    (
        Timestamp.now() - to_timedelta(rng.integers(0, 3_650, size=size), unit='min')
    ).floor('min'),
    "cpu_usage":    rng.uniform(0, 100, size=size),
    "mem_usage":    rng.uniform(0, 64, size=size),
})

ldf = from_pandas(pd_df).lazy()

ldf.collect()
shape: (100, 6)
center_idlocationservice_typetimestampcpu_usagemem_usage
strstrstrdatetime[ns]f64f64
"Center_1""East""Web Hosting"2024-09-04 00:12:0031.96857.395
"Center_1""West""Data Processing"2024-09-04 06:29:0018.75137.332
"Center_1""East""Web Hosting"2024-09-04 00:52:0067.2532.574
"Center_0""East""Web Hosting"2024-09-03 08:53:0019.51145.535
"Center_0""West""Cloud Storage"2024-09-04 00:25:0057.76936.418
"Center_1""West""Cloud Storage"2024-09-02 08:05:0058.63011.460
"Center_0""East""Cloud Storage"2024-09-02 09:59:0012.26947.886
"Center_0""West""Data Processing"2024-09-01 19:35:0093.3775.548
"Center_0""East""Cloud Storage"2024-09-02 22:54:0068.40527.255
"Center_1""West""Cloud Storage"2024-09-02 19:12:0082.37825.392

Flexibility#

Let’s think about the problem space that we have to recreate group by sets with rollup and cube. Examining some valid SQL syntax, let’s note the features we want to capture:

Arbitrary Sets

import duckdb

duckdb.query('''
    select center_id, location, mean(columns('.*_usage'))
    from ldf
    group by grouping sets (
        (center_id, location),
        (center_id),
        (),
    )
    order by center_id, location
''').pl()
shape: (7, 4)
center_idlocationmean(ldf.cpu_usage)mean(ldf.mem_usage)
strstrf64f64
"Center_0""East"47.19733.030
"Center_0""West"59.77632.753
"Center_0"null54.91632.860
"Center_1""East"55.33228.579
"Center_1""West"51.78634.511
"Center_1"null53.55931.545
nullnull54.15632.124

Combinatoric Sets (rollup & cube)

duckdb.query('''
    select center_id, location, mean(columns('.*_usage'))
    from ldf
    group by rollup(center_id, location)
    order by center_id, location
''').pl()
shape: (7, 4)
center_idlocationmean(ldf.cpu_usage)mean(ldf.mem_usage)
strstrf64f64
"Center_0""East"47.19733.030
"Center_0""West"59.77632.753
"Center_0"null54.91632.860
"Center_1""East"55.33228.579
"Center_1""West"51.78634.511
"Center_1"null53.55931.545
nullnull54.15632.124

Partial Sets

duckdb.query('''
    select service_type, center_id, location, mean(columns('.*_usage'))
    from ldf
    group by service_type, rollup(center_id, location)
    order by center_id, location
''').pl()
shape: (21, 5)
service_typecenter_idlocationmean(ldf.cpu_usage)mean(ldf.mem_usage)
strstrstrf64f64
"Cloud Storage""Center_0""East"48.57434.590
"Data Processing""Center_0""East"38.89831.946
"Web Hosting""Center_0""East"49.51131.492
"Cloud Storage""Center_0""West"59.67839.927
"Data Processing""Center_0""West"54.95421.973
"Web Hosting""Center_1"null51.60831.529
"Cloud Storage""Center_1"null62.07028.766
"Web Hosting"nullnull54.88333.791
"Cloud Storage"nullnull58.60632.231
"Data Processing"nullnull47.75429.979

So far, we have seen 3 uses for grouping:

  1. arbitrary sets

  2. combinatoric set (rollup & cube)

  3. partial combinatoric sets

Looking at these applications, I see a hierarchy of needs, where if I can define a function that works with any arbitrary sets, then I should be able to write other functions that can generate those sets and pass them into the primary entry point.

Grouping Sets#

Since we’ve identified an important entry point, let’s go ahead and define it in our first pass:

from polars import lit, concat, struct
from polars import selectors as cs

def grouping_sets(df, groupings, exprs):
    frames = []
    for i, gs in enumerate(groupings):
        if not gs:
            query = df.select(exprs)
        else:
            query = df.group_by(gs).agg(exprs)
        frames.append(
            query.with_columns(groupings=struct(id=lit(i), members=lit(gs)))
        )
    return concat(frames, how='diagonal')

grouping_sets(
    ldf,
    groupings=[['center_id', 'location'], ['service_type']],
    exprs=cs.ends_with('usage').mean(),
).collect()
shape: (7, 6)
center_idlocationcpu_usagemem_usagegroupingsservice_type
strstrf64f64struct[2]str
"Center_1""East"55.33228.579{0,["center_id", "location"]}null
"Center_1""West"51.78634.511{0,["center_id", "location"]}null
"Center_0""East"47.19733.030{0,["center_id", "location"]}null
"Center_0""West"59.77632.753{0,["center_id", "location"]}null
nullnull58.60632.231{1,["service_type"]}"Cloud Storage"
nullnull54.88333.791{1,["service_type"]}"Web Hosting"
nullnull47.75429.979{1,["service_type"]}"Data Processing"

This function will do a lot of heavy lifting, so I suspect it will be called in different manners. Notice that this works with either a polars.DataFrame or polars.LazyFrame. Additionally, the only requirement for groupings is that it is iterable, and exprs just needs to be a valid polars.Expression. By intentionally relaxing our constraints at this phase, we are maximizing the flexibility of the arguments.

Combinatorics#

Now let’s shift our attention to populating the groups that we will feed into our above function. This usage pattern has the convenience of decoupling our combinatoric iterators away from the grouping_set logic, which keeps the functions pure and simplifies testing these entities as well.

from itertools import islice, combinations

def cube(*items):
    """powerset of items from largest to smallest
    
    >>> list(cube('a', 'b'))
    [['a', 'b'], ['a'], ['b'], []]
    """
    for size in range(len(items)+1, -1, -1):
        for combo in combinations(items, size):
            yield [*combo]

def rollup(*items):
    """produce shrinking subsets of items
    
    >>> list(rollup('a', 'b', 'c', 'd'))
    [['a', 'b', 'c', 'd'], ['b', 'c', 'd'], ['c', 'd'], ['d'], []]
    """
    return (
        [*islice(items, i, None)] for i in range(len(items)+1)
    )

grouping_sets(
    ldf,
    groupings=rollup('center_id', 'service_type'),
    exprs=cs.ends_with('usage').mean(),
).collect()
shape: (10, 5)
center_idservice_typecpu_usagemem_usagegroupings
strstrf64f64struct[2]
"Center_0""Data Processing"51.51424.110{0,["center_id", "service_type"]}
"Center_1""Cloud Storage"62.07028.766{0,["center_id", "service_type"]}
"Center_1""Web Hosting"51.60831.529{0,["center_id", "service_type"]}
"Center_1""Data Processing"44.24535.456{0,["center_id", "service_type"]}
"Center_0""Cloud Storage"53.75637.080{0,["center_id", "service_type"]}
"Center_0""Web Hosting"59.25136.806{0,["center_id", "service_type"]}
null"Data Processing"47.75429.979{1,["service_type"]}
null"Cloud Storage"58.60632.231{1,["service_type"]}
null"Web Hosting"54.88333.791{1,["service_type"]}
nullnull54.15632.124{2,[]}

Partial Sets#

This case is a bit different than the previous because it does not cleanly fit into our above mechanisms. This leaves us with two options:

  1. Implement the partial logic into grouping_sets

    • begs the question: how do I toggle it on/off?

  2. Create a separate partial_grouping_sets function specifically for this case

The first option concerns itself with a modality: do we want to pass our input through the partial logic? Can we toggle it? Do we just create a near-parallel/different entry point?

In my opinion, the second option is going to be easier to work with and test; however, we also have a third option: another layer of convenience. If we consider partially creating sets, we can just view this problem as a composition alongside the existing rollup/cube implementations. This means that we can write another completely decoupled function to handle this layer of the code.

from itertools import repeat, chain
from typing import Iterator

def partial_sets(*items):
    """combine grouped values alongside rollup, cube, and other iterators
    
    >>> list(partial_sets('center_id', rollup('location', 'service_type')))
    [['center_id', 'location', 'service_type'],
     ['center_id', 'service_type'],
     ['center_id']]
    """
    groupings = (
        gs if isinstance(gs, Iterator) else repeat([gs])
        for gs in items
    )
    return (
        [*chain.from_iterable(parts)] for parts in zip(*groupings)
    )

[*partial_sets('center_id', rollup('location', 'service_type'))]
[['center_id', 'location', 'service_type'],
 ['center_id', 'service_type'],
 ['center_id']]

Again, by decoupling the set-generation logic from their evaluation against the polars.LazyFrame, we are making code that is easier to test and maintain.

Finally, let’s see how this approach performs in action:

grouping_sets(
    ldf,
    groupings=partial_sets('center_id', rollup('location', 'service_type')),
    exprs=cs.ends_with('usage').mean(),
).collect()
shape: (20, 6)
center_idlocationservice_typecpu_usagemem_usagegroupings
strstrstrf64f64struct[2]
"Center_0""East""Data Processing"38.89831.946{0,["center_id", "location", "service_type"]}
"Center_1""West""Cloud Storage"67.40632.279{0,["center_id", "location", "service_type"]}
"Center_1""East""Web Hosting"53.60827.885{0,["center_id", "location", "service_type"]}
"Center_0""West""Data Processing"54.95421.973{0,["center_id", "location", "service_type"]}
"Center_1""West""Data Processing"40.08835.195{0,["center_id", "location", "service_type"]}
"Center_0"null"Web Hosting"59.25136.806{1,["center_id", "service_type"]}
"Center_0"null"Data Processing"51.51424.110{1,["center_id", "service_type"]}
"Center_0"null"Cloud Storage"53.75637.080{1,["center_id", "service_type"]}
"Center_1"nullnull53.55931.545{2,["center_id"]}
"Center_0"nullnull54.91632.860{2,["center_id"]}

While this works as a flexible (and easily testable) solution to the given problem, it is not as ergonomic as it could be, especially in the style that Polars code exists today. Thankfully, we should be able to take the flexible utilities we created above and massage them into an ergonomic abstraction that our Polars users can access seamlessly.

Ergonomics#

Hi all, welcome back to Cameron’s Corner! This week, I want to continue the discussion above about flexibility and ergonomics. Let’s take a look at ergonomics this time.

To implement the ideas I explored last week more fluently (which is a silly term, but we’ll go with it), Polars offers a namespace registration mechanism. This mechanism essentially monkey patches the class allowing us to dynamically add attributes at runtime.

from polars import LazyFrame, concat, struct
from polars.api import register_lazyframe_namespace

@register_lazyframe_namespace('grouping')
class ExpandedLazyGroupBy:
    def __init__(self, ldf: LazyFrame):
        self._ldf = ldf
        self._contexts = []

    def __call__(self, *groupings, maintain_order=False):
        if any(isinstance(gs, Iterator) for gs in groupings):
            groupings = partial_sets(*groupings)
        return self.sets(*groupings, maintain_order=maintain_order)
    
    def sets(self, *groupings, maintain_order=False):
        contexts = []
        for g in groupings:
            ctx = (
                self._ldf.group_by(g, maintain_order=maintain_order).agg
                if g else self._ldf.select
            )
            contexts.append((g, ctx))
        self._contexts = contexts
        return self

    def agg(self, *aggs, **named_aggs):
        frames = []
        for i, (gs, ctx) in enumerate(self._contexts):
            frames.append(
                ctx(*aggs, **named_aggs)
                .with_columns(
                    groupings=struct(id=lit(i), members=lit(gs))
                )
            )
        return concat(frames, how='diagonal_relaxed')

This lets us access our new methods under the LazyFrame.grouping attribute. Whenever we do this, we get a new instance of the ExpandedLazyGroupBy.

This means that we can use this object within typical Polars syntax (method chaining). Additionally, we have rolled in the convenience of partial_sets into this interface as the default.

(
    ldf.grouping('center_id', rollup('location'))
    .agg(cs.ends_with('usage').mean())
    .collect()
)
shape: (6, 5)
center_idlocationcpu_usagemem_usagegroupings
strstrf64f64struct[2]
"Center_0""East"47.19733.030{0,["center_id", "location"]}
"Center_0""West"59.77632.753{0,["center_id", "location"]}
"Center_1""West"51.78634.511{0,["center_id", "location"]}
"Center_1""East"55.33228.579{0,["center_id", "location"]}
"Center_0"null54.91632.860{1,["center_id"]}
"Center_1"null53.55931.545{1,["center_id"]}

However, when including features like our partial_sets convenience, it is often useful to be able to turn them off. I opted for a separate function that allows you to explicitly pass pre-defined groupings sets via .sets. This skips over the input coercion step that partial_sets adds on.

(
    ldf.grouping.sets('center_id', 'location')
    .agg(cs.ends_with('usage').mean())
    .collect()
)
shape: (4, 5)
center_idcpu_usagemem_usagegroupingslocation
strf64f64struct[2]str
"Center_0"54.91632.860{0,"center_id"}null
"Center_1"53.55931.545{0,"center_id"}null
null52.25930.260{1,"location"}"East"
null55.70833.648{1,"location"}"West"

Wrap-Up#

That’s all we have time for this week! Stay tuned next time for some more fun with Polars.

What do you think about flexibility and ergonomics? Let me know on the DUTC Discord. Talk to you all next week!