Profiling pandas Filters

Hello, everyone! For Cameron's Corner this week, I wanted to spend some time differentiating between various filtering operations in pandas. Specifically, I wanted to test out operations on a DatetimeIndex for working with slices of datetime values.

from pandas import date_range, Series
from numpy.random import default_rng

rng = default_rng(0)

s = Series(
    index=date_range('2000-01-01', periods=(size:=1_000_000), freq='15s'),
    data=rng.normal(size=size)
)

s

2000-01-01 00:00:00    0.125730
2000-01-01 00:00:15   -0.132105
2000-01-01 00:00:30    0.640423
2000-01-01 00:00:45    0.104900
2000-01-01 00:01:00   -0.535669
                         ...   
2000-06-22 14:38:45    0.852949
2000-06-22 14:39:00   -1.153803
2000-06-22 14:39:15    0.300560
2000-06-22 14:39:30   -0.174469
2000-06-22 14:39:45    0.228642
Freq: 15s, Length: 1000000, dtype: float64

from itertools import pairwise
from functools import partial

from pandas import Timestamp

def boolean_indexing(s, start, stop):
    return s.loc[(Timestamp(start) <= s.index) & (s.index < Timestamp(stop))]
    
def simple_loc(s, span):
    return s.loc[span]

def manual_loc(s, start, stop):
    """manually slice pandas Series. ONLY works on sorted data.
    """
    begin, end = s.index.searchsorted(start), s.index.searchsorted(stop)
    return s.iloc[begin:end]

start, stop = '2000-02-03', '2000-02-04'
approaches = [
    partial(boolean_indexing, s, start, stop),
    partial(simple_loc, s, start),
    partial(manual_loc, s, start, stop)
]

# assert all approaches result in the same results
for app1, app2 in pairwise(approaches):
    assert app1().equals(app2()), (app1.func, app2.func)

Timings

Let's do some quick timings for each of these approaches. I've ordered them by what my intuition tells me will be slowest to fastest:

%timeit -n 5 -r 10 boolean_indexing(s, start, stop)
%timeit -n 5 -r 10 simple_loc(s, start)
%timeit -n 5 -r 10 manual_loc(s, start, stop)

8.79 ms ± 1.58 ms per loop (mean ± std. dev. of 10 runs, 5 loops each)
525 μs ± 94.8 μs per loop (mean ± std. dev. of 10 runs, 5 loops each)
99 μs ± 20 μs per loop (mean ± std. dev. of 10 runs, 5 loops each)

Looks like I was correct! Using .loc lets pandas take advantage of the sorted index to gain a huge speed boost (boolean vs simple_loc). However, we can make this filtering operation even faster by manually slicing between these two (manual_loc vs simple_loc).

It is important to note that many index-aware operations will only hit fastpaths IF we have a sorted index and/or unique index. Our manual_loc function does not check to see if this is the case and therefore will return incorrect results (which is why it is excluded below).

shuffled_s = s.sample(frac=1, random_state=0)

%timeit -n 5 -r 10 boolean_indexing(shuffled_s, start, stop)
%timeit -n 5 -r 10 simple_loc(shuffled_s, start)

9.7 ms ± 1.74 ms per loop (mean ± std. dev. of 10 runs, 5 loops each)

The slowest run took 5.32 times longer than the fastest. This could mean that an intermediate result is being cached.
7.87 ms ± 5.27 ms per loop (mean ± std. dev. of 10 runs, 5 loops each)

Let's see if we can better understand the differences in codepaths taken between the latter two approaches (.loc vs some manual tweaking of .iloc).

Profiling

We can easily profile in IPython via the %prun magic. However, in this case, I want to actually extract the executed functions for a follow up analysis. I'll define a few helper functions to structure my code.

from itertools import islice
import cProfile, pstats, io
from pstats import SortKey
from contextlib import contextmanager
from pandas import DataFrame

@contextmanager
def profile_code(buffer=None):
    if buffer is None:
        buffer = io.StringIO()
    pr = cProfile.Profile()
    pr.enable()
    yield buffer
    pr.disable()
    
    sortby = SortKey.CUMULATIVE
    ps = pstats.Stats(pr, stream=buffer).strip_dirs().sort_stats(sortby)
    ps.print_stats()
    
def to_dataframe(buffer):
    buffer.seek(0)
    stream = (
        ln.strip().split(maxsplit=5) for ln in islice(buffer, 4, None)
        if ln.strip()
    )
    
    dtypes = [str, float, float, float, float, str]
    head = next(stream)
    body = (
        [typ(val) for typ, val in zip(dtypes, row)]
        for row in stream
    )
    
    return DataFrame(body, columns=head)

Now we can comfortably profile segments of code and extract their results into a pandas.DataFrame.

with profile_code() as loc_buffer:
    s.loc['2000-03-02']

with profile_code() as iloc_buffer:
    begin = s.index.searchsorted('2000-03-02')
    end   = s.index.searchsorted('2000-03-03')
    s.iloc[begin:end]

loc_df  = to_dataframe(loc_buffer)
iloc_df = to_dataframe(iloc_buffer)

display(
    loc_df.head(),
    iloc_df.head(),
)

	ncalls	cumtime	percall	filename:lineno(function)
0	2/1	0.001	0.001	indexing.py:1176(__getitem__)
1	1	0.001	0.001	indexing.py:1397(_getitem_axis)
2	1	0.001	0.001	indexing.py:1379(_get_label)
3	1	0.001	0.001	generic.py:4159(xs)
4	1	0.001	0.001	datetimes.py:582(get_loc)

	ncalls	tottime	percall	cumtime	percall	filename:lineno(function)
0	4	0.000	0.000	0.007	0.002	base_events.py:1908(_run_once)
1	3/2	0.001	0.000	0.006	0.003	base.py:1333(searchsorted)
2	2	0.000	0.000	0.005	0.003	_mixins.py:240(searchsorted)
3	2	0.005	0.003	0.005	0.003	{method 'searchsorted' of 'numpy.ndarray' obje...
4	5	0.000	0.000	0.001	0.000	events.py:86(_run)

Comparison

Let's do a very quick comparison across the function calls that we made. The final column of the produced DataFrames give us the unique functions invoked during our profiling. From here, we can see how many function calls are either shared or unique to each approach.

col = 'filename:lineno(function)'
loc_funcs, iloc_funcs = set(loc_df[col]), set(iloc_df[col])

print(
    f'.loc[datestring]         → called {len(loc_funcs): <3} unique functions',
    f'.iloc[searchsorted(…):…] → called {len(iloc_funcs): <3} unique functions',
    sep='\n'
)

.loc[datestring]         → called 128 unique functions
.iloc[searchsorted(…):…] → called 175 unique functions

Right away, we can see that our manual .iloc approach calls fewer unique Python functions than the .loc approach. This provides an indication as to why one approach may be faster than another.

Let's go one step further and see how many function calls these approaches share traits with, and deviate from, each other.

print(
    f'Both approaches call {len(loc_funcs & iloc_funcs)} of the same functions',
    f'The .loc approach calls {len(loc_funcs - iloc_funcs)} functions NOT in the manual approach',
    f'The .iloc approach calls {len(iloc_funcs - loc_funcs)} functions NOT in the simple approach',
    sep='\n{}\n'.format('\N{box drawings light horizontal}' * 40)
)

Both approaches call 67 of the same functions
────────────────────────────────────────
The .loc approach calls 61 functions NOT in the manual approach
────────────────────────────────────────
The .iloc approach calls 108 functions NOT in the simple approach

Wrap Up

In this quick analysis, it was quite surprising the volume of overhead that pandas invokes for a simple .loc operation. Just by performing a manual search and slice, we sped up our filter operation by ~5x and made half the total number of function calls.

The differences shown here are not common culprits of slow pandas code (remember that using .loc with a datetime string was ~20x faster than our boolean masking approach). However, now you can confidently go about exploring fastpaths in pandas with an understanding as to why they might be faster.

What do you think about my approach? Let us know on the DUTC Discord server.

Talk to you all next time!

Profiling pandas Filters

Timings

Profiling

Comparison

Wrap Up

Table of Contents