pandas: Within the Restricted Computation Domain#

Hello, everyone! Welcome back to Cameron’s Corner. This week, I want to share a fascinating discussion I recently had about the Restricted Computation Domain in pandas. Well, it was actually about outlier detection within groups on a pandas DataFrame, but our conversation quickly turned to other topics.

Background#

Let’s take a look at the question and the code that kicked everything off. The original question focused on translating the following block of pandas code into Polars.

import pandas as pd

df = pd.DataFrame({
    "a": ["A", "A", "B", "A", "B", "A", "B", "B", "A"],
    "x": [-10.0, 2.1, 1001.3, 3.1, 1002.2, 2.7, 999.8, -8.0, 1.5],
})

def detect_outliers(x, r=1.5):
    q1, q3 = x.quantile([0.25, 0.75])
    iqr = q3 - q1
    return (x > q3 + r * iqr) | (x < q1 - r * iqr)

display(
    df.assign(outliers=df.groupby(["a"])["x"].apply(detect_outliers).values)
)
a x outliers
0 A -10.0 True
1 A 2.1 False
2 B 1001.3 False
3 A 3.1 False
4 B 1002.2 False
5 A 2.7 False
6 B 999.8 False
7 B -8.0 False
8 A 1.5 True

We want to perform outlier detection on the 'x' column within each group in the 'a' column. In this case, we used a .groupby(…).apply(…) pattern to call a User Defined Function (UDF) for each group.

But before we can talk about Polars, let’s first talk about some pandas concepts and the performance issues we may encounter with the pandas code above.

Restricted Computation Domains#

A restricted computation domain can be loosely thought of as a space or tool that imposes some set of rules. By abiding by the rules of that space, we see massive gains in the performance of our code. The classical example of this is NumPy:

import numpy as np
from numpy.random import default_rng

rng = default_rng(0)
np_xs = rng.integers(100, size=100_000)
py_xs = [int(x) for x in np_xs]

%timeit -n1 -r1 sum(np_xs)     # Python sum on Numpy values
%timeit -n1 -r1 np.sum(py_xs)  # NumPy sum on Python values
%timeit -n1 -r1 sum(py_xs)     # Python sum on Python values
%timeit -n1 -r1 np.sum(np_xs)  # NumPy sum on NumPy values
8.07 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
7.27 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
962 µs ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
97.3 µs ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

Right away, we can see that the fastest timing in the above was numpy.sum(np_xs). One may expect this as we are performing numeric operations via NumPy (Numeric Python)—it’s in the name! But what may be surprising is that the np.sum(py_xs) is a significantly worse benchmark. So, what’s the issue here?

The issue is that we crossed the boundary of our Restricted Computation Domain. We took a tool np.sum and applied it to something it cannot readily work with (a Python list). So, if we want to write performant code, we need to work entirely within a domain and avoid crossing domain boundaries as much as possible.

The pandas Domain#

Extending the idea of the restricted computation domain, we can think that pandas also operates with its own set of rules that let us write performant code:

  1. Avoid object dtypes in your Series

  2. Avoid iterating over the rows

  3. Avoid user-defined functions in groupby operations

The thing that these rules have in common is that they all provide guidance for us to stay in a pandas restricted computation domain. The problem we have at hand speaks directly to rule three, above. Let’s revisit the code we saw at the beginning of this post.

import pandas as pd

df = pd.DataFrame({
    "a": ["A", "A", "B", "A", "B", "A", "B", "B", "A"],
    "x": [-10.0, 2.1, 1001.3, 3.1, 1002.2, 2.7, 999.8, -8.0, 1.5],
})

def detect_outliers(x, r=1.5):
    q1, q3 = x.quantile([0.25, 0.75])
    iqr = q3 - q1
    return (x > q3 + r * iqr) | (x < q1 - r * iqr)

display(
    #  We break rule 3. below:
    #                                        #######################
    df.assign(outliers=df.groupby(["a"])["x"].apply(detect_outliers).values)
)
a x outliers
0 A -10.0 True
1 A 2.1 False
2 B 1001.3 False
3 A 3.1 False
4 B 1002.2 False
5 A 2.7 False
6 B 999.8 False
7 B -8.0 False
8 A 1.5 True

Before I get to the price we’re paying by breaking this rule, let’s first talk about the rules that govern groupby(…).apply in pandas. When passing a user-defined function to groupby(…).apply, pandas will locate each of the groups, then iterate over those groups (at the Python level) and evaluate the function call on each of those groups.

This is an incredibly flexible pattern, but it inherently crosses the domain boundary because we are now ① iterating at the Python level ② evaluating a Python function call once per group.

If we want try and stay within the domain, we should ① leverage appropriate groupby verbs and ② avoid passing UDFs. To leverage the appropriate verb in a groupby operation, we have the following choices: aggregate, transform, and apply. The briefest summary of these choices is to use groupby(…).aggregate to return one value per group, groupby(…).transform to receive a result whose output is the same size as the inputted group, and groupby(…).apply to operate with multiple columns within the same grouped operation or when you’re unsure of the size of the result. In fact, groupby(…).apply can be slower than its more specific counterparts.

pandas improvement#

For this case, the only grouped operation we want to evaluate is the quantile calculation. If we transform this operation back to its original shape, then we can uniformly evaluate the algebraic transformations and comparisons irrespective of the groups.

This means we should favor .groupby(…).transform and evaluate the quantiles as early as possible. The resultant code might look like this:

# Not a UDF in sight!
grouped = df.groupby(['a'])
q1  = grouped['x'].transform('quantile', .25)
q3  = grouped['x'].transform('quantile', .75)
iqr = q3 - q1
r   = 1.5
df.assign(
    outliers=(df['x'] > q3 + (r * iqr)) | (df['x'] < q1 - (r * iqr))
)
a x outliers
0 A -10.0 True
1 A 2.1 False
2 B 1001.3 False
3 A 3.1 False
4 B 1002.2 False
5 A 2.7 False
6 B 999.8 False
7 B -8.0 True
8 A 1.5 False

Do note that we can still use UDFs as structuring mechanisms, but we should avoid passing UDFs into groupby operations.

from string import ascii_lowercase
from time import perf_counter
from contextlib import contextmanager

@contextmanager
def timed(msg=''):
    start = perf_counter()
    yield
    stop = perf_counter()
    print(f'{msg:<24}elapsed {stop-start:.6f}s')

rng = default_rng(0)
size = 10_000
large_df = pd.DataFrame({
    "a": rng.choice([*ascii_lowercase], size=(size, 2)).view('<U2').ravel(),
    "x": rng.integers(100, size=size),
})

I’ll use the terms 'wrapped' and 'unwrapped' below to refer to the groupby(…).apply version of the code and the ‘Non UDF’ version of the code respectively.

with timed('wrapped'):
    wrapped_outliers = (
        #                            ######### using the correct verb
        large_df.groupby(["a"])["x"].transform(detect_outliers)
    )

with timed('unwrapped'):
    grouped = large_df.groupby(['a'])
    q1  = grouped['x'].transform('quantile', .25)
    q3  = grouped['x'].transform('quantile', .75)
    iqr = q3 - q1
    r   = 1.5
    unwrapped_outliers = (
        (large_df['x'] > q3 + (r * iqr)) | (large_df['x'] < q1 - (r * iqr))
    )

assert (wrapped_outliers == unwrapped_outliers).all()
wrapped                 elapsed 0.900374s
unwrapped               elapsed 0.006418s

Whew, look at that performance gain! Our “unwrapped” version is much faster than its UDF counterpart.

The Polars Domain#

Welcome back to this week’s Cameron’s Corner! I want to continue our discussion of Restricted Computation Domains and share my thoughts on the interface that Polars brings to the table, while also comparing it against NumPy/pandas counterparts.

Polars is another popular Python DataFrame library, which—in contrast to pandas—does not use NumPy in its backend. Instead of exposing array interfaces to the end-user, Polars has opted to provide users with the ability to compose and evaluate the work that is to be done in the domain. This form of API nearly forces the users to abide by the rules of the domain because it becomes trickier to avoid crossing the domain boundary. The most important rule of Polars is…

Only use the Expression interface: if you abide by this rule then you can usually assume that your code will be performant.

Before we discuss expression syntax, let’s take a look at how easily we can write non-performant code in NumPy. If we apply the same operation twice…

  1. We mistakenly cross the domain boundary repeatedly and it impacts our performance

  2. We abide correctly within NumPy’s domain

orig_xs = rng.uniform(size=(100_000, 10))

# 1. crosses the domain repeatedly due to Python iteration
xs = orig_xs.copy()
with timed('Python iteration'):
    for i, row in enumerate(xs):
        xs[i, :] = row * 100
    
# 2. no boundary crossing
xs = orig_xs.copy()
with timed('NumPy RCD'):
    xs *= 100
Python iteration        elapsed 0.320450s
NumPy RCD               elapsed 0.001285s

Unless you are familiar with NumPy, I think the above code in section one is quite auspicious. Let’s contrast this to a Polars equivalent where first, we intentionally cross the RCD boundary, and second, we remain within the domain.

from polars import DataFrame as pl_DataFrame, all as pl_all

df = pl_DataFrame(orig_xs)

with timed('Python iteration'):
    new_rows = []
    for row in df.iter_rows():
        new_rows.append([r * 100 for r in row])
    result1 = pl_DataFrame(new_rows, schema=df.schema)
    
with timed('Polars RCD'):
    result2 = df.select(pl_all() * 100)

result1.equals(result2)
Python iteration        elapsed 0.468501s
Polars RCD              elapsed 0.014243s
True

While this example is a touch exaggerated, it highlights an important point that I like about Polars: it is tedious for us to cross the domain boundary. I need to use numerous additional methods and mechanics just to iterate over the rows of the DataFrame and create a new one. By contrast, the expression syntax exists as a first-class citizen in Polars code.

But that’s not all there is to the Polars restricted computation domain. In fact, by providing an interface for users to construct queries to pass into the domain (instead of exposing an underlying direct array interface), Polars can optimize…

  1. caching redundant expressions

  2. reordering expressions to a more optimal solution

We can enable these optimizations simply by building a query (instead of immediately evaluating it) via the LazyFrame interface.

from polars import col
from IPython.display import display, Markdown
from re import sub

ldf = df.lazy()

query = (
    df.lazy()
    .filter(col('column_1') > .5)
    .select(
        x=col('column_2') * 100,
        y=col('column_2') * 100 > 30,
    )
)

# display(Markdown('## Unoptimized'))
# print(query.explain(optimized=False))

print(
    '--Unoptimized--',
    query.explain(optimized=False),
    
    '\N{box drawings light horizontal}' * 80,
    
    '--Optimized--',
    sub(
        r'SELECT \[(?P<cols>.+)\], (?P<CSE>CSE = \[.+\]) FROM\s+(?P<table>.+)', 
        r'SELECT \n    [\1]\n    [\2]\n FROM \3',
        query.explain(optimized=True)),
    sep='\n',
)
--Unoptimized--
 SELECT [[(col("column_2")) * (100.0)].alias("x"), [([(col("column_2")) * (100.0)]) > (30.0)].alias("y")] FROM
  FILTER [(col("column_1")) > (0.5)] FROM
    DF ["column_0", "column_1", "column_2", "column_3"]; PROJECT */10 COLUMNS; SELECTION: None
────────────────────────────────────────────────────────────────────────────────
--Optimized--
 SELECT 
    [col("__POLARS_CSER_0xb14fcd343b0e91fc").alias("x"), [(col("__POLARS_CSER_0xb14fcd343b0e91fc")) > (30.0)].alias("y")]
    [CSE = [[(col("column_2")) * (100.0)].alias("__POLARS_CSER_0xb14fcd343b0e91fc")]]
 FROM DF ["column_0", "column_1", "column_2", "column_3"]; PROJECT 2/10 COLUMNS; SELECTION: [(col("column_1")) > (0.5)]

The above query plans can be read similarly to SQL, but even if you’re unfamiliar with the syntax here, take a look at the differences between the two queries.

First, the unoptimized query contains a FILTER operation that is simply not present in the optimized query. This is because the filter operation is pushed down to point to where it’s applied as the data is read in from the source (known as “predicate pushdown”). This means we can perform multiple operations in a single pass on the data (something that we cannot easily do in NumPy). Second, we can see that there is a new column that is not present in the output: '__POLARS_CSER_...'. This refers to “Common Sub-Plan Elimination”, where Polars detects that we wanted to perform the same operation twice ('column_2 * 100') and decides to cache the result to that operation and reuse it in multiple places instead of evaluating it multiple times.

These optimizations are something unique to Polars since its preferred API provides building blocks to pass complex computations into the domain, as opposed to the NumPy approach which exposes low-level data to the user.

Wrap-Up#

Whew! That’s a lot of discussion of restricted computation domains. While Polars is a shiny new tool, NumPy/pandas are still incredibly flexible workhorses. The benefits of this flexibility simply come at the price of knowing which operations will cross the domain boundary and which will not. Once you develop this intuition, you’ll be able to write more efficient code than ever before.

What do you think about these tools? Which would you use in your code? Something not making sense? Let me know on the DUTC Discord server.

Talk to you all next week!