pandas & Polars: Within the Restricted Computation Domain

Hello, everyone! Welcome back to Cameron's Corner. This week, I want to share a fascinating discussion I recently had about the Restricted Computation Domain in pandas. Well, it was actually about outlier detection within groups on a pandas DataFrame, but our conversation quickly turned to other topics.

Background

Let's take a look at the question and the code that kicked everything off. The original question focused on translating the following block of pandas code into Polars.

import pandas as pd

df = pd.DataFrame({
    "a": ["A", "A", "B", "A", "B", "A", "B", "B", "A"],
    "x": [-10.0, 2.1, 1001.3, 3.1, 1002.2, 2.7, 999.8, -8.0, 1.5],
})

def detect_outliers(x, r=1.5):
    q1, q3 = x.quantile([0.25, 0.75])
    iqr = q3 - q1
    return (x > q3 + r * iqr) | (x < q1 - r * iqr)

display(
    df.assign(outliers=df.groupby(["a"])["x"].apply(detect_outliers).values)
)

	a	x	outliers
0	A	-10.0	True
1	A	2.1	False
2	B	1001.3	False
3	A	3.1	False
4	B	1002.2	False
5	A	2.7	False
6	B	999.8	False
7	B	-8.0	False
8	A	1.5	True

We want to perform outlier detection on the 'x' column within each group in the 'a' column. In this case, we used a .groupby(…).apply(…) pattern to call a User Defined Function (UDF) for each group.

But before we can talk about Polars, let's first talk about some pandas concepts and the performance issues we may encounter with the pandas code above.

Restricted Computation Domains

A restricted computation domain can be loosely thought of as a space or tool that imposes some set of rules. By abiding by the rules of that space, we see massive gains in the performance of our code. The classical example of this is NumPy:

import numpy as np
from numpy.random import default_rng

rng = default_rng(0)
np_xs = rng.integers(100, size=100_000)
py_xs = [int(x) for x in np_xs]

%timeit -n1 -r1 sum(np_xs)     # Python sum on Numpy values
%timeit -n1 -r1 np.sum(py_xs)  # NumPy sum on Python values
%timeit -n1 -r1 sum(py_xs)     # Python sum on Python values
%timeit -n1 -r1 np.sum(np_xs)  # NumPy sum on NumPy values

5.34 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
4.37 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
757 μs ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
288 μs ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

Right away, we can see that the fastest timing in the above was numpy.sum(np_xs). One may expect this as we are performing numeric operations via NumPy (Numeric Python)—it's in the name! But what may be surprising is that the np.sum(py_xs) is a significantly worse benchmark. So, what's the issue here?

The issue is that we crossed the boundary of our Restricted Computation Domain. We took a tool np.sum and applied it to something it cannot readily work with (a Python list). So, if we want to write performant code, we need to work entirely within a domain and avoid crossing domain boundaries as much as possible.

The pandas Domain

Extending the idea of the restricted computation domain, we can think that pandas also operates with its own set of rules that let us write performant code:

Avoid object dtypes in your Series
Avoid iterating over the rows
Avoid user-defined functions in groupby operations

The thing that these rules have in common is that they all provide guidance for us to stay in a pandas restricted computation domain. The problem we have at hand speaks directly to rule three, above. Let's revisit the code we saw at the beginning of this post.

import pandas as pd

df = pd.DataFrame({
    "a": ["A", "A", "B", "A", "B", "A", "B", "B", "A"],
    "x": [-10.0, 2.1, 1001.3, 3.1, 1002.2, 2.7, 999.8, -8.0, 1.5],
})

def detect_outliers(x, r=1.5):
    q1, q3 = x.quantile([0.25, 0.75])
    iqr = q3 - q1
    return (x > q3 + r * iqr) | (x < q1 - r * iqr)

display(
    #  We break rule 3. below:
    #                                        #######################
    df.assign(outliers=df.groupby(["a"])["x"].apply(detect_outliers).droplevel(0))
)

	a	x	outliers
0	A	-10.0	True
1	A	2.1	False
2	B	1001.3	False
3	A	3.1	False
4	B	1002.2	False
5	A	2.7	False
6	B	999.8	False
7	B	-8.0	True
8	A	1.5	False

Before I get to the price we're paying by breaking this rule, let's first talk about the rules that govern groupby(…).apply in pandas. When passing a user-defined function to groupby(…).apply, pandas will locate each of the groups, then iterate over those groups (at the Python level) and evaluate the function call on each of those groups.

This is an incredibly flexible pattern, but it inherently crosses the domain boundary because we are now ① iterating at the Python level ② evaluating a Python function call once per group.

If we want try and stay within the domain, we should ① leverage appropriate groupby verbs and ② avoid passing UDFs. To leverage the appropriate verb in a groupby operation, we have the following choices: aggregate, transform, and apply. The briefest summary of these choices is to use groupby(…).aggregate to return one value per group, groupby(…).transform to receive a result whose output is the same size as the inputted group, and groupby(…).apply to operate with multiple columns within the same grouped operation or when you're unsure of the size of the result. In fact, groupby(…).apply can be slower than its more specific counterparts.

pandas improvement

For this case, the only grouped operation we want to evaluate is the quantile calculation. If we transform this operation back to its original shape, then we can uniformly evaluate the algebraic transformations and comparisons irrespective of the groups.

This means we should favor .groupby(…).transform and evaluate the quantiles as early as possible. The resultant code might look like this:

# Not a UDF in sight!
grouped = df.groupby(['a'])
q1  = grouped['x'].transform('quantile', .25)
q3  = grouped['x'].transform('quantile', .75)
iqr = q3 - q1
r   = 1.5
df.assign(
    outliers=(df['x'] > q3 + (r * iqr)) | (df['x'] < q1 - (r * iqr))
)

	a	x	outliers
0	A	-10.0	True
1	A	2.1	False
2	B	1001.3	False
3	A	3.1	False
4	B	1002.2	False
5	A	2.7	False
6	B	999.8	False
7	B	-8.0	True
8	A	1.5	False

Do note that we can still use UDFs as structuring mechanisms, but we should avoid passing UDFs into groupby operations.

from string import ascii_lowercase
from time import perf_counter
from contextlib import contextmanager

@contextmanager
def timed(msg=''):
    start = perf_counter()
    yield
    stop = perf_counter()
    print(f'{msg:<24}elapsed {stop-start:.6f}s')

rng = default_rng(0)
size = 10_000
large_df = pd.DataFrame({
    "a": rng.choice([*ascii_lowercase], size=(size, 2)).view('<U2').ravel(),
    "x": rng.integers(100, size=size),
})

I'll use the terms 'wrapped' and 'unwrapped' below to refer to the groupby(…).apply version of the code and the 'Non UDF' version of the code respectively.

with timed('wrapped'):
    wrapped_outliers = (
        #                            ######### using the correct verb
        large_df.groupby(["a"])["x"].transform(detect_outliers)
    )

with timed('unwrapped'):
    grouped = large_df.groupby(['a'])
    q1  = grouped['x'].transform('quantile', .25)
    q3  = grouped['x'].transform('quantile', .75)
    iqr = q3 - q1
    r   = 1.5
    unwrapped_outliers = (
        (large_df['x'] > q3 + (r * iqr)) | (large_df['x'] < q1 - (r * iqr))
    )

assert (wrapped_outliers == unwrapped_outliers).all()

wrapped                 elapsed 0.479519s
unwrapped               elapsed 0.005052s

Whew, look at that performance gain! Our "unwrapped" version is much faster than its UDF counterpart.

The Polars Domain

Welcome back to this week's Cameron's Corner! I want to continue our discussion of Restricted Computation Domains and share my thoughts on the interface that Polars brings to the table, while also comparing it against NumPy/pandas counterparts.

Polars is another popular Python DataFrame library, which—in contrast to pandas—does not use NumPy in its backend. Instead of exposing array interfaces to the end-user, Polars has opted to provide users with the ability to compose and evaluate the work that is to be done in the domain. This form of API nearly forces the users to abide by the rules of the domain because it becomes trickier to avoid crossing the domain boundary. The most important rule of Polars is...

Only use the Expression interface: if you abide by this rule then you can usually assume that your code will be performant.

Before we discuss expression syntax, let's take a look at how easily we can write non-performant code in NumPy. If we apply the same operation twice...

We mistakenly cross the domain boundary repeatedly and it impacts our performance
We abide correctly within NumPy's domain

orig_xs = rng.uniform(size=(100_000, 10))

# 1. crosses the domain repeatedly due to Python iteration
xs = orig_xs.copy()
with timed('Python iteration'):
    for i, row in enumerate(xs):
        xs[i, :] = row * 100
    
# 2. no boundary crossing
xs = orig_xs.copy()
with timed('NumPy RCD'):
    xs *= 100

Python iteration        elapsed 0.171735s
NumPy RCD               elapsed 0.001616s

Unless you are familiar with NumPy, I think the above code in section one is quite auspicious. Let's contrast this to a Polars equivalent where first, we intentionally cross the RCD boundary, and second, we remain within the domain.

from polars import DataFrame as pl_DataFrame, all as pl_all

df = pl_DataFrame(orig_xs)

with timed('Python iteration'):
    new_rows = []
    for row in df.iter_rows():
        new_rows.append([r * 100 for r in row])
    result1 = pl_DataFrame(new_rows, schema=df.schema, orient='row')
    
with timed('Polars RCD'):
    result2 = df.select(pl_all() * 100)

result1.equals(result2)

Python iteration        elapsed 0.318870s
Polars RCD              elapsed 0.008218s

True

While this example is a touch exaggerated, it highlights an important point that I like about Polars: it is tedious for us to cross the domain boundary. I need to use numerous additional methods and mechanics just to iterate over the rows of the DataFrame and create a new one. By contrast, the expression syntax exists as a first-class citizen in Polars code.

But that's not all there is to the Polars restricted computation domain. In fact, by providing an interface for users to construct queries to pass into the domain (instead of exposing an underlying direct array interface), Polars can optimize...

caching redundant expressions
reordering expressions to a more optimal solution

We can enable these optimizations simply by building a query (instead of immediately evaluating it) via the LazyFrame interface.

from polars import col
from IPython.display import display, Markdown
from re import sub

ldf = df.lazy()

query = (
    df.lazy()
    .filter(col('column_1') > .5)
    .select(
        x=col('column_2') * 100,
        y=col('column_2') * 100 > 30,
    )
)

# display(Markdown('## Unoptimized'))
# print(query.explain(optimized=False))

print(
    '--Unoptimized--',
    query.explain(optimized=False),
    
    '\N{box drawings light horizontal}' * 80,
    
    '--Optimized--',
    sub(
        r'SELECT \[(?P<cols>.+)\], (?P<CSE>CSE = \[.+\]) FROM\s+(?P<table>.+)', 
        r'SELECT \n    [\1]\n    [\2]\n FROM \3',
        query.explain(optimized=True)),
    sep='\n',
)

--Unoptimized--
 SELECT [[(col("column_2")) * (100.0)].alias("x"), [([(col("column_2")) * (100.0)]) > (30.0)].alias("y")] FROM
  FILTER [(col("column_1")) > (0.5)] FROM
    DF ["column_0", "column_1", "column_2", "column_3"]; PROJECT */10 COLUMNS; SELECTION: None
────────────────────────────────────────────────────────────────────────────────
--Optimized--
 SELECT [col("__POLARS_CSER_0xa654911ff7a35c03").alias("x"), [(col("__POLARS_CSER_0xa654911ff7a35c03")) > (30.0)].alias("y")] FROM
   WITH_COLUMNS:
   [[(col("column_2")) * (100.0)].alias("__POLARS_CSER_0xa654911ff7a35c03")] 
    DF ["column_0", "column_1", "column_2", "column_3"]; PROJECT 2/10 COLUMNS; SELECTION: [(col("column_1")) > (0.5)]

The above query plans can be read similarly to SQL, but even if you're unfamiliar with the syntax here, take a look at the differences between the two queries.

First, the unoptimized query contains a FILTER operation that is simply not present in the optimized query. This is because the filter operation is pushed down to point to where it's applied as the data is read in from the source (known as "predicate pushdown"). This means we can perform multiple operations in a single pass on the data (something that we cannot easily do in NumPy). Second, we can see that there is a new column that is not present in the output: '__POLARS_CSER_...'. This refers to “Common Sub-Plan Elimination”, where Polars detects that we wanted to perform the same operation twice ('column_2 * 100') and decides to cache the result to that operation and reuse it in multiple places instead of evaluating it multiple times.

These optimizations are something unique to Polars since its preferred API provides building blocks to pass complex computations into the domain, as opposed to the NumPy approach which exposes low-level data to the user.

Translating pandas → Polars

To bring us full circle (as Henry from our DUTC Discord server pointed out), we still need to address the inspiration for this blog post. Recall that we began with an outlier detection problem in pandas. Remember that the way to stay in the Polars Restricted Computation Domain is to:

Always use expressions

So with that in mind, we should be able to translate the following outlier detection function

# the unoptimized pandas outlier detection function
def detect_outliers(x, r=1.5):
    q1, q3 = x.quantile([0.25, 0.75])
    iqr = q3 - q1
    return (x > q3 + r * iqr) | (x < q1 - r * iqr)

so that it readily works with Polars we can create a function that takes in an expression and produces an expression. This decouples the function from the underlying data that it is being computed against, which lets Polars make any relevant optimizations to the final query plan.

This lets us have the convenience of using user-defined functions that bundle complex expressions into reusable/parameterized units.

Also take a good look at how close the Polars function below matches to the original pandas implementation above! We just need to change a single line of code q1, q3 = ….quantile([.25, .75]) to q1, q3 = ….quantile(.25), ….quantile(.75).

def detect_outliers(expr, r=1.5):
    q1, q3 = expr.quantile(0.25), expr.quantile(0.75)
    iqr = q3 - q1
    return (expr > q3 + r * iqr) | (expr < q1 - r * iqr)

type(detect_outliers(col('x'))) # This returns a new expression!

polars.expr.expr.Expr

from polars import DataFrame as pl_DataFrame, col

df = pl_DataFrame({
    "a": ["A", "A", "B", "A", "B", "A", "B", "B", "A"],
    "x": [-10.0, 2.1, 1001.3, 3.1, 1002.2, 2.7, 999.8, -8.0, 1.5],
})

df.with_columns(
    outliers=detect_outliers(col('x')).over('a')
)

shape: (9, 3)

a	x	outliers
str	f64	bool
"A"	-10.0	true
"A"	2.1	false
"B"	1001.3	false
"A"	3.1	false
"B"	1002.2	false
"A"	2.7	false
"B"	999.8	false
"B"	-8.0	true
"A"	1.5	false

Wrap-Up

Whew! That's a lot of discussion of restricted computation domains. While Polars is a shiny new tool, NumPy/pandas are still incredibly flexible workhorses. The benefits of this flexibility simply come at the price of knowing which operations will cross the domain boundary and which will not. Once you develop this intuition, you'll be able to write more efficient code than ever before.

What do you think about these tools? Which would you use in your code? Something not making sense? Let me know on the DUTC Discord server.

Talk to you all next week!