Timing DataFrame Filters#

Hello, everyone! I wanted to follow up on last week’s blog post, Profiling pandas Filters, and test how Polars stands up in its simple filtering operations.

An important note: these timings are NOT intended to be exhaustive and should not be used to determine if one tool is “better” than another.

Let’s create some dummy data for us to use to investigate:

import pandas; import polars

print(
    f'{pandas.__version__ = }',
    f'{polars.__version__ = }',
    sep='\n',
)
pandas.__version__ = '1.5.3'
polars.__version__ = '0.20.7'
from pandas import date_range, DataFrame
from polars import from_pandas, col
from numpy.random import default_rng

rng = default_rng(0)


pd_df = DataFrame({                           # ⓐ
    'date': date_range(
        '2000-01-01', periods=(size:=1_000_000), freq='15s'
    ),
    'value': rng.normal(size=size),
})
pd_df_idx = pd_df.set_index('date')           # ⓑ
pl_df = from_pandas(pd_df).set_sorted('date') # ⓒ
pl_lazy = pl_df.lazy()                        # ⓓ


pl_df.head(5)
shape: (5, 2)
datevalue
datetime[ns]f64
2000-01-01 00:00:000.12573
2000-01-01 00:00:15-0.132105
2000-01-01 00:00:300.640423
2000-01-01 00:00:450.1049
2000-01-01 00:01:00-0.535669

Timing Setup#

We will only time each operation once. For the most part, this won’t be a huge issue. However, some testing has revealed that pandas is performing some caching, flag setting, or consolidation (unlikely) after its first .loc operation.

This finding warrants further investigation. The objects used to track timings are…

from dataclasses import dataclass, field
from contextlib import contextmanager
from time import perf_counter

@dataclass
class Timer:
    start: float = None
    end: float = None
    
    @property
    def elapsed(self):
        if self.start is None or self.end is None:
            raise ValueError('Timer must have both end and start')
        return self.end - self.start

@dataclass
class TimerManager:
    registry: list = field(default_factory=list)
    
    @contextmanager
    def time(self, description):
        timer =  Timer(start=perf_counter())
        yield timer
        timer.end = perf_counter()
        self.registry.append((description, timer))

pandas Filters#

For pandas, I wanted to test four approaches:

  1. .loc with a combined boolean mask

  2. .loc using .between (in case any fast paths exist in this approach)

  3. .loc with a sorted index, simply supplying in the start+range of values.

  4. .loc same as the third approach but with a SIGNIFICANT speed difference in the second .loc

from pandas import Timestamp
pd_start = Timestamp(2000, 4, 1)
pd_stop = Timestamp(2000, 4, 2)

timer = TimerManager()

# 1.
with timer.time('pd.loc[start <= date < end]'):
    pd_df.loc[
        (pd_start <= pd_df['date']) & (pd_df['date'] < pd_stop)
    ]
    
# 2.
with timer.time('pd.loc[date.between(…)]'):
    pd_df.loc[
        pd_df['date'].between(pd_start, pd_stop, inclusive='left')
    ]
    
# 3.
with timer.time('pd.loc[date] {cold}'):
    pd_df_idx.loc['2000-06-01']
    
# 4. 
# this performs MUCH faster than the above, it is a direct output
# caching because there is still a speed difference even with different inputs
with timer.time('pd.loc[date] {warm}'):
    pd_df_idx.loc['2000-04-01']
    
for desc, t in timer.registry:
    if desc.startswith('pd'):
        print(f'{desc: >35}{t.elapsed*1000:.6f}µs')
        pd.loc[start <= date < end] → 21.833562µs
            pd.loc[date.between(…)] → 12.665219µs
                pd.loc[date] {cold} → 55.443002µs
                pd.loc[date] {warm} → 0.503281µs

Polars Filters#

For Polars, I wanted to test some approaches on both an eager polars.DataFrame and a polars.LazyFrame. Aside from the slight differences between the two, the major approaches I wanted to handling the date are…

  1. filter where the start <= date < stop

  2. filter where date.dt.date() == start.date()

  3. slicing, using a binary search for our dates to find slicing indices

from polars import datetime
pl_start = datetime(2000, 4, 1)
pl_stop = datetime(2000, 4, 2)

with timer.time('pl.filter(start <= date < stop)'):
    pl_df.filter(
        (pl_start <= col('date')) & (col('date') < pl_stop)
    )
    
with timer.time('pl.lazy.filter(start <= date < stop)'):
    pl_lazy.filter(
        (pl_start <= col('date')) & (col('date') < pl_stop)
    ).collect()
    
with timer.time('pl.filter(date.date == start)'):
    pl_df.filter(
        (col('date').dt.date() == pl_start.dt.date())
    )
    
with timer.time('pl.lazy.filter(date.date == start)'):
    pl_lazy.filter(
        (col('date').dt.date() == pl_start.dt.date())
    ).collect()
    
with timer.time('pl[idx_start:idx_stop]'):
    pl_df[
        slice(
            *pl_df.select(
                idx_start=col('date').search_sorted(pl_start),
                idx_stop =col('date').search_sorted(pl_stop)
            ).row(0)
        )
    ]
    
with timer.time('pl[lazy.idx_start:lazy.idx_stop]'):
    pl_df[
        slice(
            *pl_lazy.select(
                idx_start=col('date').search_sorted(pl_start),
                idx_stop=col('date').search_sorted(pl_stop)
            )
            .collect()
            .row(0)
        )
    ]
    
for desc, t in timer.registry:
    if desc.startswith('pl'):
        print(f'{desc: >35}{t.elapsed*1000:.6f}µs')
    pl.filter(start <= date < stop) → 13.505592µs
pl.lazy.filter(start <= date < stop) → 12.125287µs
      pl.filter(date.date == start) → 14.157752µs
 pl.lazy.filter(date.date == start) → 14.510133µs
             pl[idx_start:idx_stop] → 4.142557µs
   pl[lazy.idx_start:lazy.idx_stop] → 5.277686µs

Results#

Last—but certainly not least—I wanted to compile all of the timing results and visualize them. Again, these timings are NOT intended to be exhaustive and should not be used to determine if one tools is “better” than another.

df = (
    DataFrame(timer.registry, columns=['description', 'timer'])
    .assign(
        elapsed_us=lambda d: 1_000 * d['timer'].map(lambda t: t.elapsed),
        package=lambda d: d['description'].str.extract('^(pl|pd)'),
    )
    .drop(columns='timer')
    .sort_values('elapsed_us', ascending=False)
)

df
description elapsed_us package
2 pd.loc[date] {cold} 55.443002 pd
0 pd.loc[start <= date < end] 21.833562 pd
7 pl.lazy.filter(date.date == start) 14.510133 pl
6 pl.filter(date.date == start) 14.157752 pl
4 pl.filter(start <= date < stop) 13.505592 pl
1 pd.loc[date.between(…)] 12.665219 pd
5 pl.lazy.filter(start <= date < stop) 12.125287 pl
9 pl[lazy.idx_start:lazy.idx_stop] 5.277686 pl
8 pl[idx_start:idx_stop] 4.142557 pl
3 pd.loc[date] {warm} 0.503281 pd

Quite the mix of timings! In this case, it appears that there is not too much of a difference between polars.DataFrame and polars.LazyFrame (these computations aren’t incredibly parallelizable, so this is not surprising). We do see a slight increase in the speed of Polars operations that have multiple expressions (see the difference in the Polars Lazy vs eager filter for start <= date <= stop).

Aside from that, the most surprising result is the difference between the cold and warm .loc operations in pandas. This is something I’ll explore more in the future, perhaps taking some approaches we saw in Profiling pandas Filters.

%matplotlib inline
from matplotlib.pyplot import rc, setp
from flexitext import flexitext

rc('figure', figsize=(10, 6), facecolor='white')
rc('font', size=12)
rc('axes.spines', top=False, right=False, left=False)


palette = {
    'pl': '#1F77B4FF',
    'pd': '#FF7F0EFF',
}

ax = df.plot.barh(
    x='description', y='elapsed_us', legend=False, width=.8,
    color=df['package'].map(palette),
)
ax.set_ylabel('')
ax.yaxis.set_tick_params(length=0)
ax.bar_label(ax.containers[0], fmt='{:.2f}', padding=5)
ax.set_xlabel(r'Duration (µs)')
setp(ax.get_yticklabels(), va='center')

ax.figure.tight_layout()
left_x = min(text.get_tightbbox().x0 for text in ax.get_yticklabels())
x,_ = ax.transAxes.inverted().transform([left_x, 0])

annot = flexitext(
    s='<size:xx-large,weight:semibold>Time Elapsed for Datetime Filtering in'
     f' <color:{palette["pl"]}>Polars <size:medium>{polars.__version__}</></>'
     f' vs <color:{palette["pd"]}>pandas <size:medium>{pandas.__version__}</></>'
      ' </>',
    x=x, y=1.02, va='bottom', ha='left',
);
../_images/d91a681c17bf1a04bb6874dab847cb766a6f973b1b1b54ff9f672b9eb785cd14.png

Wrap-Up#

What do you think about my approach? Anything you’d do differently? Something not making sense? Let me know on the DUTC Discord server.

That’s all for this week! Stay tuned for another dive into DataFrames next week!

Talk to you then!