DataFrame Value Membership Testing#
This week, I received a great question on our Discord Server about finding strings within a list in a pandas.Series
.
But, before I get started, I want to invite you to our upcoming µtraining (“micro-training”) that we will be hosting on December 19th and 21st. This unique training format ensures direct interaction with instructors and your peers, providing practical insights and immediate problem-solving guidance.
Ready to reshape your Python journey? Join us for two days of immersive learning, where expertise meets real-world application!
Now, back to the question:
In pandas, if we have a column of strings, we can check to see if each of the elements are in a given set, thanks to the the
isin
method. What if the column contains a list of strings instead? How could we check to see if this list is a subset of a given reference list. Should we useexplode()…isin()
and then aggregate again or are there better ways to do it?
In each of the following examples, I will create a DataFrame with a column of
'words'
. In some examples, the 'words'
column will simply contain a string in
each cell of the Series
. Then, we’ll move to the more complex problem of nested
strings in a Series
.
Simple String Finding#
The simplest version of this problem removes the nested “string within a list” component. There are a few ways to see if a string matches multiple targets in pandas. We can either…
perform an
.isin
checkconcatenate the targets into a regular expression and use
.str.match
from pandas import DataFrame as pd_DataFrame, Series
words_to_find = ['phenomenal', 'drowse']
df = pd_DataFrame({'words': [
'stria', 'phenomenal', 'lungs', 'drowse', 'exhorting'
]})
df.assign(
multi_check=lambda d: d['words'].isin(words_to_find),
multi_check_re=lambda d: d['words'].str.match(r'|'.join(words_to_find)),
).style.map( # formatting for display purposes
lambda v: 'background-color: green' if v is True else ''
)
words | multi_check | multi_check_re | |
---|---|---|---|
0 | stria | False | False |
1 | phenomenal | True | True |
2 | lungs | False | False |
3 | drowse | True | True |
4 | exhorting | False | False |
Both approaches look fairly straightforward. But, what if our DataFrame consists of a list of values?
Nested String Finding - Pandas#
In this problem, our 'words'
Series
contains lists of strings. Since this is
backed by the 'object'
dtype, we will expect its performance to be on par with
base Python. Additionally, I was curious to see if I could do a few different things:
Indicate whether or not one of the words in the
DataFrame
matches a word in thewords_to_find
Count the number of matches between each row in the
DataFrame
and the words inwords_to_find
Filter the words in each row of the
DataFrame
down to only those that match
To do each of these, I took two primary approaches:
.explode
the nested data, then reduce back down to the original index.apply
with an object dtype, which doesn’t require doing much else
from pandas import DataFrame as DataFrame
words_to_find = ['phenomenal', 'lungs', 'drowse', 'snack']
df = DataFrame({
'words': [
['stria', 'phenomenal', 'lungs'],
['drowse', 'exhorting', 'ark'],
['bespeaking', 'aggregates', 'confute'],
['snack', 'pleasingly', 'trench'],
['loather', 'overmodest', 'wardress']
]
})
df.assign(
explode_overlap=lambda d:
d['words'].explode().isin(words_to_find).groupby(level=0).any(),
explode_counting=lambda d:
d['words'].explode().isin(words_to_find).groupby(level=0).sum(),
explode_intersection=lambda d:
d['words'].explode()
.loc[lambda d: d.isin(words_to_find)]
.groupby(level=0).apply(list),
apply_overlap=lambda d:
d['words'].apply(lambda v: any(w in v for w in words_to_find)),
apply_counting=lambda d:
d['words'].apply(lambda v: sum(w in v for w in words_to_find)),
apply_intersection=lambda d:
d['words'].apply(lambda v: [w for w in words_to_find if w in v]),
)
words | explode_overlap | explode_counting | explode_intersection | apply_overlap | apply_counting | apply_intersection | |
---|---|---|---|---|---|---|---|
0 | [stria, phenomenal, lungs] | True | 2 | [phenomenal, lungs] | True | 2 | [phenomenal, lungs] |
1 | [drowse, exhorting, ark] | True | 1 | [drowse] | True | 1 | [drowse] |
2 | [bespeaking, aggregates, confute] | False | 0 | NaN | False | 0 | [] |
3 | [snack, pleasingly, trench] | True | 1 | [snack] | True | 1 | [snack] |
4 | [loather, overmodest, wardress] | False | 0 | NaN | False | 0 | [] |
Nested String Finding - Polars#
Polars provides access to a List datatype. This is a powerful feature that is sure
to speed up the search we are trying to do. In this section, I will take the same
approaches as above and leverage relevant Polars expressions. The .explode
approach
ends up being much more verbose than the other two approaches, but we can see the
convenience that the List datatype offers: we can perform list.set_intersection
to find which
words in each row of the DataFrame
intersect with our words_to_find
. We can also apply
element-wise operations via .list.eval
which should be much faster than our naive .apply
approach in pandas.
from polars import from_pandas, col, element as pl_element
(
from_pandas(df)
.with_row_index()
.explode('words')
.group_by('index').agg(
words=col('words'),
explode_intersection=col('words').filter(col('words').is_in(words_to_find)),
)
.with_columns(
set_intersection=col('words').list.set_intersection(words_to_find),
element_overlap=(
col('words').list.eval(pl_element().is_in(words_to_find)).list.any()
),
)
)
index | words | explode_intersection | set_intersection | element_overlap |
---|---|---|---|---|
u32 | list[str] | list[str] | list[str] | bool |
0 | ["stria", "phenomenal", "lungs"] | ["phenomenal", "lungs"] | ["phenomenal", "lungs"] | true |
1 | ["drowse", "exhorting", "ark"] | ["drowse"] | ["drowse"] | true |
2 | ["bespeaking", "aggregates", "confute"] | [] | [] | false |
3 | ["snack", "pleasingly", "trench"] | ["snack"] | ["snack"] | true |
4 | ["loather", "overmodest", "wardress"] | [] | [] | false |
Performance Testing#
Now that we’ve discussed some approaches to this problem, let’s test it out on a larger dataset and then compare the runtime of each.
In the below setup, I create a list of 1,000 words that I am searching for.
Then, I create 10,000 lists with 50 strings each and store those data
in both Polars and pandas DataFrames. For the performance test, I will
only evaluate this Boolean condition: “is a word from words_to_find
in the list
of words for this row of the DataFrame?”
from numpy.random import default_rng
with open('./data/words') as f:
word_pool = [
word for l in f
if (word := l.strip()).islower()
and len(word) > 3
and word.isascii()
and word.isalpha()
]
rng = default_rng(0)
words_to_find = list(rng.choice(word_pool, size=1_000))
pd_df = DataFrame(
{'words': rng.choice(word_pool, size=(10_000, 50)).tolist()}
)
pl_df = from_pandas(pd_df).lazy()
display(
words_to_find[:5],
pd_df.head(),
pl_df.head(5).collect(),
)
['stressful', 'philatelist', 'lungfuls', 'drollest', 'exfoliates']
words | |
---|---|
0 | [mooting, bisexually, conveyable, tadpole, ind... |
1 | [devouring, sandhogs, fickleness, pandemonium,... |
2 | [whispers, infuriates, moratorium, hermaphrodi... |
3 | [sinkholes, electrics, incarnate, unappropriat... |
4 | [pasturage, acct, genuflecting, regurgitates, ... |
words |
---|
list[str] |
["mooting", "bisexually", … "implementations"] |
["devouring", "sandhogs", … "toccata"] |
["whispers", "infuriates", … "defects"] |
["sinkholes", "electrics", … "downloading"] |
["pasturage", "acct", … "rocker"] |
Let’s set up a timing context manager for some quick tests:
from time import perf_counter, sleep
from contextlib import contextmanager
@contextmanager
def timer():
start = perf_counter()
yield (t := [None])
stop = perf_counter()
t[0] = stop-start
with timer() as t:
sleep(1)
print(f'the block of code above took {t[0]:.6f}s')
the block of code above took 1.001037s
Here are all of the tests we are running:
pandas using the explode → reduce
pandas using .apply
Polars using
.list.set_difference
Polars using explode → reduce
Polars using
list.eval
for elementwise operations
performance_tests = {
('pandas', '.explode…groupby'): lambda:
pd_df['words']
.explode()
.isin(words_to_find)
.groupby(level=0).any(),
('pandas', '.apply'): lambda:
pd_df['words']
.apply(lambda v: any(w in v for w in words_to_find)),
('polars', '.explode…groupby'): lambda:
pl_df.with_row_index()
.explode('words')
.group_by('index')
.agg(col('words').is_in(words_to_find).any())
.select(col('words'))
.collect(),
('polars', '.set_difference'): lambda:
pl_df.select(
col('words').list.set_intersection(list(words_to_find))
.list.len() >= 1
).collect(),
('polars', '.element'): lambda:
pl_df.select(
col('words').list.eval(pl_element().is_in(words_to_find)).list.any()
).collect(),
}
With our tests set up, let’s run each of them to see how long each approach takes. We’ll also be careful to store the result of the operation to ensure each test generates the same set of results.
timing_results, numeric_results = [], []
for (backend, desc), test in performance_tests.items():
with timer() as t:
res = test()
if not isinstance(res, Series): # coerce all results to comparable structures
res = res.to_series().to_pandas()
numeric_results.append(res)
timing_results.append(
{'backend': backend, 'description': desc, 'time delta': t[0]}
)
# ensure all tests generated the same set of results
from itertools import pairwise
for left, right in pairwise(numeric_results):
assert left.equals(right)
Looks like everything ran to completion, so let’s see how long each approach took.
(
pd_DataFrame(timing_results)
.set_index(['backend', 'description'])
.sort_values(['time delta'])
).style.background_gradient(cmap='Greens', subset='time delta')
time delta | ||
---|---|---|
backend | description | |
polars | .element | 0.008686 |
.set_difference | 0.013529 | |
.explode…groupby | 0.015430 | |
pandas | .explode…groupby | 0.044180 |
.apply | 4.422752 |
Clearly, Polars has an advantage here. For tests of this size (1,000 targets in 10,000 lists each containing 50 strings), it seems like Polars has a decent advantage regardless of the approach taken. pandas is a bit behind in terms of performance in this operation, but I wouldn’t say that is interesting takeaway. What’s interesting is that:
All of the Polars approaches are relatively similar in their speed
pandas has a LARGE difference in the fastest and slowest approaches
This supports my observation that naive code in Polars will still run fast, whereas naive code in pandas can have detrimental performance impact.
That’s all I have for this week. Until next time!