Faster strftime#
Welcome back to this week’s Cameron’s Corner! Before we get started, I want to let you know about our upcoming public seminar series, “(Even More) Python Basics for Experts.” Join James in this three-session series about (even more) Python basics that experts need to make their code more effective and efficient. He’ll tackle what’s real, how we can tell it’s real, and how we can do less work.
On to the topic at hand. I wanted to tackle a fun pandas optimization problem, focusing on converting datetime objects to their date counterparts. For this problem, I did take it “head on,” meaning I did not inquire why the end user wanted this output, just performed some benchmarking on their existing approaches and threw in a couple of my own.
To get started, let’s make a dataset:
The Data#
We have a pandas.Series
with a 'datetime64[ns]'
datatype. There was not a
specification to the cardinality of the data, but the granularity was minute-wise.
For the approaches below, I would suspect different results based on the granularity
of the timeseries, as well as the presence of duplicates.
from pandas import Series, date_range, Categorical, merge
s = Series(date_range('2000-01-01', periods=1_000_000, freq='5T'))
display(
s.head(),
f'{s.shape[0] = :,} | {s.dt.floor("D").nunique() = }'
)
/tmp/ipykernel_800733/954603642.py:2: FutureWarning: 'T' is deprecated and will be removed in a future version, please use 'min' instead.
s = Series(date_range('2000-01-01', periods=1_000_000, freq='5T'))
0 2000-01-01 00:00:00
1 2000-01-01 00:05:00
2 2000-01-01 00:10:00
3 2000-01-01 00:15:00
4 2000-01-01 00:20:00
dtype: datetime64[ns]
's.shape[0] = 1,000,000 | s.dt.floor("D").nunique() = 3473'
The goal is to end with something like this output, but, of course, on all 1M datapoints instead of just the first five. So, we effectively want to convert each of these dates from their minute granularity down to their corresponding day.
s.head().dt.strftime('%Y%m%d')
0 20000101
1 20000101
2 20000101
3 20000101
4 20000101
dtype: object
The Timer#
To get started, I’m going to create a quick class to hold my timings so we
can compare our micro-benchmarks later. The TimeManager class can be used as a
context manager to track each approach via a passed string description
.
from dataclasses import dataclass, field
from contextlib import contextmanager
from time import perf_counter
@dataclass
class Timer:
start: float = None
end: float = None
@property
def elapsed(self):
if self.start is None or self.end is None:
raise ValueError('Timer must have both end and start')
return self.end - self.start
@dataclass
class TimerManager:
registry: list = field(default_factory=list)
@contextmanager
def time(self, description):
timer = Timer(start=perf_counter())
yield timer
timer.end = perf_counter()
self.registry.append((description, timer))
print(f'{description:<30}{timer.end - timer.start:.6f}s')
timer, solutions = TimerManager(), []
pandas#
Let’s first look at our first three approaches implemented in pure pandas.
Perhaps the most obvious is to simply use
.dt.strftime
with the desired output format.We can deduplicate our
pandas.Series
, work across unique entities, thenreindex
back to our original positions.Again deduplication, but instead of reindexing we recreate a new Categorical Series via
pandas.Categorical.from_codes
.
The comments in the code indicate who originated that specific snippet. Comments ending in CR indicate that I originated this snippet.
with timer.time('pandas|.strftime'): # ① end-user
solutions.append(
s.dt.strftime('%Y%m%d')
)
with timer.time('pandas|floor→dedupe→realign'): # ② CR
tmp = s.dt.floor('D').rename('date')
solutions.append(
tmp
.set_axis(tmp)
.drop_duplicates().dt.strftime('%Y%m%d')
.reindex(tmp)
.set_axis(s.index)
)
with timer.time('pandas|floor→factorize→strftime'): # ③ CR
codes, cats = s.dt.floor('D').factorize()
solutions.append(
Categorical.from_codes(codes, categories=cats.strftime('%Y%m%d'))
)
pandas|.strftime 4.822085s
pandas|floor→dedupe→realign 0.063474s
pandas|floor→factorize→strftime0.038231s
Integer Math#
“numba
makes everything fast, right?”
I’m never a fan of blanket statements, but based on the function that numba.vectorize
is being applied to, it does seem like it is a great candidate for some JIT compilation.
The idea here is to end with an int64
that mimics the above string output.
To push things into numba
’s advantage, I’ll even skip on converting the end result
to a string. For a comparison against the engineered
numba
approach I am going to use pandas to perform the same computation, just
relying on multiplying each portion of the date (e.g., the .year
by the appropriate
number of decimal places I want it to end up in.
numba
, perform math on the parts of a datepandas, perform math on the parts of a date
import numba
from numba import vectorize, int64
@numba.vectorize([int64(int64)])
def nanos_to_yyyymmdd_numba(nanos: int) -> int:
"""
:WARNING: This is copy-pasted and pythonified from this C algorithm:
<https://howardhinnant.github.io/date_algorithms.html#civil_from_days>.
Tt has been only lightly checked over a range of ~20y for correct results.
"""
z = (nanos // 86400000000000) + 719468
era = (z if z >= 0 else z - 146096) // 146097
doe = z - era * 146097 # [0, 146096]
yoe = (doe - doe // 1460 + doe // 36524 - doe // 146096) // 365 # [0, 399]
y = yoe + era * 400
doy = doe - (365 * yoe + yoe // 4 - yoe // 100) # [0, 365]
mp = (5 * doy + 2) // 153 # [e, 11]
d = doy - (153 * mp + 2) // 5 + 1 #[1, 31]
m = mp + 3 if mp < 10 else mp-9
y = y + 1 if m <= 2 else y
return y * 10_000 + m * 100 + d
with timer.time('numba|parts math'): # ④ end-user
array_nanos = s.dt.tz_localize(None).to_numpy(dtype='int64', copy=False)
solutions.append(
nanos_to_yyyymmdd_numba(array_nanos)
)
with timer.time('pandas|parts math'): # ⑤ end-user
solutions.append(
(s.dt.year*10_000 + s.dt.month*100 + s.dt.day)
)
numba|parts math 0.035491s
pandas|parts math 0.065108s
Polars#
Lastly, I thought, “Why not give Polars a shot?” It’s supposed to blazingly fast, and I can confirm that claim with anecdotal evidence. So, let’s see how it stacks up against the other tools/approaches. I’m going to recreate the two approaches I used in pandas for comparison.
Direct
.strftime
conversionDe-duplicate and self-join (pushes some work into a join instead of a date→string conversion)
from polars import from_pandas, col
pl_df = from_pandas(s.to_frame('ts')).lazy()
with timer.time('polars|strftime'): # ⑥ CR
solutions.append(
pl_df.select(col('ts').dt.strftime('%Y%m%d'))
.collect()
)
solutions[-1] = solutions[-1].to_pandas()['ts']
with timer.time('polars|date→dedupe→join'): # ⑦ CR
linkage = pl_df.with_columns(date=col('ts').dt.date())
solutions.append(
linkage
.unique('date')
.with_columns(date_str=col('date').dt.strftime('%Y%m%d'))
.join(linkage, on='date', how='inner')
.collect()
)
solutions[-1] = solutions[-1].to_pandas()['date_str']
polars|strftime 0.193334s
polars|date→dedupe→join 0.024993s
Checking Accuracy#
To verify all approaches generated the same output, I checked them in a pairwise
manner. If any two DataFrames
are not the same, then we will observe an error.
from itertools import pairwise
print(f'{len(solutions) = }')
for s1, s2 in pairwise(solutions):
assert (s1.astype(int) == s2.astype(int)).all()
len(solutions) = 7
from pandas import DataFrame
df = (
DataFrame(timer.registry, columns=['description', 'timer'])
.assign(
elapsed_ms=lambda d: 1_000 * d['timer'].map(lambda t: t.elapsed),
package=lambda d: d['description'].str.extract('(.*)\|'),
)
.drop(columns='timer')
.sort_values('elapsed_ms', ascending=False)
)
df
description | elapsed_ms | package | |
---|---|---|---|
0 | pandas|.strftime | 4822.085410 | pandas |
5 | polars|strftime | 193.334366 | polars |
4 | pandas|parts math | 65.107530 | pandas |
1 | pandas|floor→dedupe→realign | 63.474179 | pandas |
2 | pandas|floor→factorize→strftime | 38.231458 | pandas |
3 | numba|parts math | 35.491455 | numba |
6 | polars|date→dedupe→join | 24.992786 | polars |
Visualize The Micro-benchmarks#
%matplotlib inline
from matplotlib.pyplot import rc, setp
from flexitext import flexitext
import polars, pandas
rc('figure', figsize=(10, 6), facecolor='white')
rc('font', size=12)
rc('axes.spines', top=False, right=False, left=False)
palette = {
'polars': '#1F77B4FF',
'pandas': '#FF7F0EFF',
'numba': '#2CA02C',
}
ax = df.plot.barh(
x='description', y='elapsed_ms', legend=False, width=.8,
color=df['package'].map(palette),
)
ax.set_ylabel('')
ax.yaxis.set_tick_params(length=0)
ax.bar_label(ax.containers[0], fmt='{:.0f}ms', padding=5)
ax.set_xlabel(r'Duration (ms)')
ax.margins(y=0)
new_labels = []
for lab in ax.get_yticklabels():
package, _, approach = lab.get_text().partition('|')
lab.set(
text=f'{package}{approach:>25}',
color=palette[package],
size='large',
fontfamily='monospace',
)
new_labels.append(lab)
ax.set_yticklabels(new_labels)
ax.figure.tight_layout()
left_x = min(text.get_tightbbox().x0 for text in ax.get_yticklabels())
x,_ = ax.transAxes.inverted().transform([left_x, 0])
annot = flexitext(
s='<size:x-large,weight:semibold>'
'Time Elapsed for 1 million datetime→date string conversion\n'
'</>'
'<size:large>'
f' <color:{palette["pandas"]}>pandas {pandas.__version__}</>'
f' vs <color:{palette["numba"]}>numba {numba.__version__}</>'
f' vs <color:{palette["polars"]}>Polars {polars.__version__}</>'
'</>',
x=x, y=1.02, va='bottom', ha='left',
);
Wrap-Up#
Seems like all tools—numba
, Polars, and pandas—can tackle this problem
with very similar timings for the fastest datetime→date string(ish) conversion.
In this case, our manual strategies (deduplication & realignment) created
the largest improvement in time, rather than the specific tool that we used.
I personally much prefer the pandas and Polars solution as I am hesitant
about untested numba
because the former will be much easier to maintain in the future.
Let the library authors write library code.
That’s it for this week! When approaching any type of optimization problem, make sure that you question the algorithm first rather than reach for unfamiliar tools just because they’re marketed as “faster” alternatives.
What did you think about my approach? Let me know on our Discord. Talk to you all next week!