pandas: Within the Restricted Computation Domain#
Hello, everyone! Welcome back to Cameron’s Corner. This week, I want to share a fascinating discussion I recently had about the Restricted Computation Domain in pandas. Well, it was actually about outlier detection within groups on a pandas DataFrame, but our conversation quickly turned to other topics.
Background#
Let’s take a look at the question and the code that kicked everything off. The original question focused on translating the following block of pandas code into Polars.
import pandas as pd
df = pd.DataFrame({
"a": ["A", "A", "B", "A", "B", "A", "B", "B", "A"],
"x": [-10.0, 2.1, 1001.3, 3.1, 1002.2, 2.7, 999.8, -8.0, 1.5],
})
def detect_outliers(x, r=1.5):
q1, q3 = x.quantile([0.25, 0.75])
iqr = q3 - q1
return (x > q3 + r * iqr) | (x < q1 - r * iqr)
display(
df.assign(outliers=df.groupby(["a"])["x"].apply(detect_outliers).values)
)
a | x | outliers | |
---|---|---|---|
0 | A | -10.0 | True |
1 | A | 2.1 | False |
2 | B | 1001.3 | False |
3 | A | 3.1 | False |
4 | B | 1002.2 | False |
5 | A | 2.7 | False |
6 | B | 999.8 | False |
7 | B | -8.0 | False |
8 | A | 1.5 | True |
We want to perform outlier detection on the 'x'
column within each group in the
'a'
column. In this case, we used a .groupby(…).apply(…)
pattern to call a
User Defined Function (UDF) for each group.
But before we can talk about Polars, let’s first talk about some pandas concepts and the performance issues we may encounter with the pandas code above.
Restricted Computation Domains#
A restricted computation domain can be loosely thought of as a space or tool that imposes some set of rules. By abiding by the rules of that space, we see massive gains in the performance of our code. The classical example of this is NumPy:
import numpy as np
from numpy.random import default_rng
rng = default_rng(0)
np_xs = rng.integers(100, size=100_000)
py_xs = [int(x) for x in np_xs]
%timeit -n1 -r1 sum(np_xs) # Python sum on Numpy values
%timeit -n1 -r1 np.sum(py_xs) # NumPy sum on Python values
%timeit -n1 -r1 sum(py_xs) # Python sum on Python values
%timeit -n1 -r1 np.sum(np_xs) # NumPy sum on NumPy values
8.07 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
7.27 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
962 µs ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
97.3 µs ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
Right away, we can see that the fastest timing in the above was numpy.sum(np_xs)
.
One may expect this as we are performing numeric operations via NumPy (Numeric Python)—it’s in the name! But what may be surprising is that the np.sum(py_xs)
is a significantly
worse benchmark. So, what’s the issue here?
The issue is that we crossed the boundary of our Restricted Computation Domain.
We took a tool np.sum
and applied it to something it cannot readily work with
(a Python list). So, if we want to write performant code, we need to work entirely
within a domain and avoid crossing domain boundaries as much as possible.
The pandas Domain#
Extending the idea of the restricted computation domain, we can think that pandas also operates with its own set of rules that let us write performant code:
Avoid object dtypes in your Series
Avoid iterating over the rows
Avoid user-defined functions in groupby operations
The thing that these rules have in common is that they all provide guidance for us to stay in a pandas restricted computation domain. The problem we have at hand speaks directly to rule three, above. Let’s revisit the code we saw at the beginning of this post.
import pandas as pd
df = pd.DataFrame({
"a": ["A", "A", "B", "A", "B", "A", "B", "B", "A"],
"x": [-10.0, 2.1, 1001.3, 3.1, 1002.2, 2.7, 999.8, -8.0, 1.5],
})
def detect_outliers(x, r=1.5):
q1, q3 = x.quantile([0.25, 0.75])
iqr = q3 - q1
return (x > q3 + r * iqr) | (x < q1 - r * iqr)
display(
# We break rule 3. below:
# #######################
df.assign(outliers=df.groupby(["a"])["x"].apply(detect_outliers).values)
)
a | x | outliers | |
---|---|---|---|
0 | A | -10.0 | True |
1 | A | 2.1 | False |
2 | B | 1001.3 | False |
3 | A | 3.1 | False |
4 | B | 1002.2 | False |
5 | A | 2.7 | False |
6 | B | 999.8 | False |
7 | B | -8.0 | False |
8 | A | 1.5 | True |
Before I get to the price we’re paying by breaking this rule, let’s first talk about
the rules that govern groupby(…).apply
in pandas. When passing a user-defined
function to groupby(…).apply
, pandas will locate each of the groups, then
iterate over those groups (at the Python level) and evaluate the function call
on each of those groups.
This is an incredibly flexible pattern, but it inherently crosses the domain boundary because we are now ① iterating at the Python level ② evaluating a Python function call once per group.
If we want try and stay within the domain, we should ① leverage appropriate groupby
verbs
and ② avoid passing UDFs. To leverage the appropriate verb in a groupby
operation,
we have the following choices: aggregate, transform, and apply. The briefest summary of these choices is to use groupby(…).aggregate
to return one value per group, groupby(…).transform
to receive a result whose output is the same size as the inputted group, and
groupby(…).apply
to
operate with multiple columns within the same grouped operation or when you’re unsure of the size of the result. In fact,
groupby(…).apply can be slower than its more specific counterparts.
pandas improvement#
For this case, the only grouped operation we want to evaluate is the quantile calculation. If we transform this operation back to its original shape, then we can uniformly evaluate the algebraic transformations and comparisons irrespective of the groups.
This means we should favor .groupby(…).transform
and evaluate the quantiles
as early as possible. The resultant code might look like this:
# Not a UDF in sight!
grouped = df.groupby(['a'])
q1 = grouped['x'].transform('quantile', .25)
q3 = grouped['x'].transform('quantile', .75)
iqr = q3 - q1
r = 1.5
df.assign(
outliers=(df['x'] > q3 + (r * iqr)) | (df['x'] < q1 - (r * iqr))
)
a | x | outliers | |
---|---|---|---|
0 | A | -10.0 | True |
1 | A | 2.1 | False |
2 | B | 1001.3 | False |
3 | A | 3.1 | False |
4 | B | 1002.2 | False |
5 | A | 2.7 | False |
6 | B | 999.8 | False |
7 | B | -8.0 | True |
8 | A | 1.5 | False |
Do note that we can still use UDFs as structuring mechanisms, but we should
avoid passing UDFs into groupby
operations.
from string import ascii_lowercase
from time import perf_counter
from contextlib import contextmanager
@contextmanager
def timed(msg=''):
start = perf_counter()
yield
stop = perf_counter()
print(f'{msg:<24}elapsed {stop-start:.6f}s')
rng = default_rng(0)
size = 10_000
large_df = pd.DataFrame({
"a": rng.choice([*ascii_lowercase], size=(size, 2)).view('<U2').ravel(),
"x": rng.integers(100, size=size),
})
I’ll use the terms 'wrapped'
and 'unwrapped'
below to refer to the
groupby(…).apply
version of the code and the ‘Non UDF’ version of the code
respectively.
with timed('wrapped'):
wrapped_outliers = (
# ######### using the correct verb
large_df.groupby(["a"])["x"].transform(detect_outliers)
)
with timed('unwrapped'):
grouped = large_df.groupby(['a'])
q1 = grouped['x'].transform('quantile', .25)
q3 = grouped['x'].transform('quantile', .75)
iqr = q3 - q1
r = 1.5
unwrapped_outliers = (
(large_df['x'] > q3 + (r * iqr)) | (large_df['x'] < q1 - (r * iqr))
)
assert (wrapped_outliers == unwrapped_outliers).all()
wrapped elapsed 0.900374s
unwrapped elapsed 0.006418s
Whew, look at that performance gain! Our “unwrapped” version is much faster than its UDF counterpart.
The Polars Domain#
Welcome back to this week’s Cameron’s Corner! I want to continue our discussion of Restricted Computation Domains and share my thoughts on the interface that Polars brings to the table, while also comparing it against NumPy/pandas counterparts.
Polars is another popular Python DataFrame library, which—in contrast to pandas—does not use NumPy in its backend. Instead of exposing array interfaces to the end-user, Polars has opted to provide users with the ability to compose and evaluate the work that is to be done in the domain. This form of API nearly forces the users to abide by the rules of the domain because it becomes trickier to avoid crossing the domain boundary. The most important rule of Polars is…
Only use the Expression interface: if you abide by this rule then you can usually assume that your code will be performant.
Before we discuss expression syntax, let’s take a look at how easily we can write non-performant code in NumPy. If we apply the same operation twice…
We mistakenly cross the domain boundary repeatedly and it impacts our performance
We abide correctly within NumPy’s domain
orig_xs = rng.uniform(size=(100_000, 10))
# 1. crosses the domain repeatedly due to Python iteration
xs = orig_xs.copy()
with timed('Python iteration'):
for i, row in enumerate(xs):
xs[i, :] = row * 100
# 2. no boundary crossing
xs = orig_xs.copy()
with timed('NumPy RCD'):
xs *= 100
Python iteration elapsed 0.320450s
NumPy RCD elapsed 0.001285s
Unless you are familiar with NumPy, I think the above code in section one is quite auspicious. Let’s contrast this to a Polars equivalent where first, we intentionally cross the RCD boundary, and second, we remain within the domain.
from polars import DataFrame as pl_DataFrame, all as pl_all
df = pl_DataFrame(orig_xs)
with timed('Python iteration'):
new_rows = []
for row in df.iter_rows():
new_rows.append([r * 100 for r in row])
result1 = pl_DataFrame(new_rows, schema=df.schema)
with timed('Polars RCD'):
result2 = df.select(pl_all() * 100)
result1.equals(result2)
Python iteration elapsed 0.468501s
Polars RCD elapsed 0.014243s
True
While this example is a touch exaggerated, it highlights an important point that I like about Polars: it is tedious for us to cross the domain boundary. I need to use numerous additional methods and mechanics just to iterate over the rows of the DataFrame and create a new one. By contrast, the expression syntax exists as a first-class citizen in Polars code.
But that’s not all there is to the Polars restricted computation domain. In fact, by providing an interface for users to construct queries to pass into the domain (instead of exposing an underlying direct array interface), Polars can optimize…
caching redundant expressions
reordering expressions to a more optimal solution
We can enable these optimizations simply by building a query (instead of immediately
evaluating it) via the LazyFrame
interface.
from polars import col
from IPython.display import display, Markdown
from re import sub
ldf = df.lazy()
query = (
df.lazy()
.filter(col('column_1') > .5)
.select(
x=col('column_2') * 100,
y=col('column_2') * 100 > 30,
)
)
# display(Markdown('## Unoptimized'))
# print(query.explain(optimized=False))
print(
'--Unoptimized--',
query.explain(optimized=False),
'\N{box drawings light horizontal}' * 80,
'--Optimized--',
sub(
r'SELECT \[(?P<cols>.+)\], (?P<CSE>CSE = \[.+\]) FROM\s+(?P<table>.+)',
r'SELECT \n [\1]\n [\2]\n FROM \3',
query.explain(optimized=True)),
sep='\n',
)
--Unoptimized--
SELECT [[(col("column_2")) * (100.0)].alias("x"), [([(col("column_2")) * (100.0)]) > (30.0)].alias("y")] FROM
FILTER [(col("column_1")) > (0.5)] FROM
DF ["column_0", "column_1", "column_2", "column_3"]; PROJECT */10 COLUMNS; SELECTION: None
────────────────────────────────────────────────────────────────────────────────
--Optimized--
SELECT
[col("__POLARS_CSER_0xb14fcd343b0e91fc").alias("x"), [(col("__POLARS_CSER_0xb14fcd343b0e91fc")) > (30.0)].alias("y")]
[CSE = [[(col("column_2")) * (100.0)].alias("__POLARS_CSER_0xb14fcd343b0e91fc")]]
FROM DF ["column_0", "column_1", "column_2", "column_3"]; PROJECT 2/10 COLUMNS; SELECTION: [(col("column_1")) > (0.5)]
The above query plans can be read similarly to SQL, but even if you’re unfamiliar with the syntax here, take a look at the differences between the two queries.
First, the unoptimized query contains a FILTER
operation that is simply not
present in the optimized query. This is because the filter operation is pushed down
to point to where it’s applied as the data is read in from the source (known as “predicate
pushdown”). This means we can perform multiple operations in a single pass on the
data (something that we cannot easily do in NumPy). Second, we can see that
there is a new column that is not present in the output: '__POLARS_CSER_...'
.
This refers to “Common Sub-Plan Elimination”, where Polars detects that we wanted
to perform the same operation twice ('column_2 * 100'
) and decides to cache
the result to that operation and reuse it in multiple places instead of
evaluating it multiple times.
These optimizations are something unique to Polars since its preferred API provides building blocks to pass complex computations into the domain, as opposed to the NumPy approach which exposes low-level data to the user.
Wrap-Up#
Whew! That’s a lot of discussion of restricted computation domains. While Polars is a shiny new tool, NumPy/pandas are still incredibly flexible workhorses. The benefits of this flexibility simply come at the price of knowing which operations will cross the domain boundary and which will not. Once you develop this intuition, you’ll be able to write more efficient code than ever before.
What do you think about these tools? Which would you use in your code? Something not making sense? Let me know on the DUTC Discord server.
Talk to you all next week!