DataFrame Value Membership Testing#

This week, I received a great question on our Discord Server about finding strings within a list in a pandas.Series.

But, before I get started, I want to invite you to our upcoming µtraining (“micro-training”) that we will be hosting on December 19th and 21st. This unique training format ensures direct interaction with instructors and your peers, providing practical insights and immediate problem-solving guidance.

Ready to reshape your Python journey? Join us for two days of immersive learning, where expertise meets real-world application!

Now, back to the question:

In pandas, if we have a column of strings, we can check to see if each of the elements are in a given set, thanks to the the isin method. What if the column contains a list of strings instead? How could we check to see if this list is a subset of a given reference list. Should we use explode()…isin() and then aggregate again or are there better ways to do it?

In each of the following examples, I will create a DataFrame with a column of 'words'. In some examples, the 'words' column will simply contain a string in each cell of the Series. Then, we’ll move to the more complex problem of nested strings in a Series.

Simple String Finding#

The simplest version of this problem removes the nested “string within a list” component. There are a few ways to see if a string matches multiple targets in pandas. We can either…

  • perform an .isin check

  • concatenate the targets into a regular expression and use .str.match

from pandas import DataFrame as pd_DataFrame, Series

words_to_find = ['phenomenal', 'drowse']
df = pd_DataFrame({'words': [
    'stria', 'phenomenal', 'lungs', 'drowse', 'exhorting'
]})

df.assign(
    multi_check=lambda d: d['words'].isin(words_to_find),
    multi_check_re=lambda d: d['words'].str.match(r'|'.join(words_to_find)),
).style.applymap( # formatting for display purposes
    lambda v: 'background-color: green' if v is True else ''
)
  words multi_check multi_check_re
0 stria False False
1 phenomenal True True
2 lungs False False
3 drowse True True
4 exhorting False False

Both approaches look fairly straightforward. But, what if our DataFrame consists of a list of values?

Nested String Finding - Pandas#

In this problem, our 'words' Series contains lists of strings. Since this is backed by the 'object' dtype, we will expect its performance to be on par with base Python. Additionally, I was curious to see if I could do a few different things:

  1. Indicate whether or not one of the words in the DataFrame matches a word in the words_to_find

  2. Count the number of matches between each row in the DataFrame and the words in words_to_find

  3. Filter the words in each row of the DataFrame down to only those that match

To do each of these, I took two primary approaches:

  1. .explode the nested data, then reduce back down to the original index

  2. .apply with an object dtype, which doesn’t require doing much else

from pandas import DataFrame as DataFrame

words_to_find = ['phenomenal', 'lungs', 'drowse', 'snack']
df = DataFrame({
    'words': [
        ['stria', 'phenomenal', 'lungs'],
        ['drowse', 'exhorting', 'ark'],
        ['bespeaking', 'aggregates', 'confute'],
        ['snack', 'pleasingly', 'trench'],
        ['loather', 'overmodest', 'wardress']
    ]
})

df.assign(
    explode_overlap=lambda d: 
        d['words'].explode().isin(words_to_find).groupby(level=0).any(),
    explode_counting=lambda d: 
        d['words'].explode().isin(words_to_find).groupby(level=0).sum(),
    explode_intersection=lambda d: 
        d['words'].explode()
        .loc[lambda d: d.isin(words_to_find)]
        .groupby(level=0).apply(list),
    apply_overlap=lambda d: 
        d['words'].apply(lambda v: any(w in v for w in words_to_find)),
    apply_counting=lambda d: 
        d['words'].apply(lambda v: sum(w in v for w in words_to_find)),
    apply_intersection=lambda d:
        d['words'].apply(lambda v: [w for w in words_to_find if w in v]),
)
words explode_overlap explode_counting explode_intersection apply_overlap apply_counting apply_intersection
0 [stria, phenomenal, lungs] True 2 [phenomenal, lungs] True 2 [phenomenal, lungs]
1 [drowse, exhorting, ark] True 1 [drowse] True 1 [drowse]
2 [bespeaking, aggregates, confute] False 0 NaN False 0 []
3 [snack, pleasingly, trench] True 1 [snack] True 1 [snack]
4 [loather, overmodest, wardress] False 0 NaN False 0 []

Nested String Finding - Polars#

Polars provides access to a List datatype. This is a powerful feature that is sure to speed up the search we are trying to do. In this section, I will take the same approaches as above and leverage relevant Polars expressions. The .explode approach ends up being much more verbose than the other two approaches, but we can see the convenience that the List datatype offers: we can perform list.set_intersection to find which words in each row of the DataFrame intersect with our words_to_find. We can also apply element-wise operations via .list.eval which should be much faster than our naive .apply approach in pandas.

from polars import from_pandas, col, element as pl_element

(
    from_pandas(df)
    .with_row_count().
    explode('words')
    .group_by('row_nr').agg(
        words=col('words'),
        explode_intersection=col('words').filter(col('words').is_in(words_to_find)),
    )
    .with_columns(
        set_intersection=col('words').list.set_intersection(words_to_find),
        element_overlap=(
            col('words').list.eval(pl_element().is_in(words_to_find)).list.any()
        ),
    )
)
shape: (5, 5)
row_nrwordsexplode_intersectionset_intersectionelement_overlap
u32list[str]list[str]list[str]bool
0["stria", "phenomenal", "lungs"]["phenomenal", "lungs"]["phenomenal", "lungs"]true
1["drowse", "exhorting", "ark"]["drowse"]["drowse"]true
2["bespeaking", "aggregates", "confute"][][]false
3["snack", "pleasingly", "trench"]["snack"]["snack"]true
4["loather", "overmodest", "wardress"][][]false

Performance Testing#

Now that we’ve discussed some approaches to this problem, let’s test it out on a larger dataset and then compare the runtime of each.

In the below setup, I create a list of 1,000 words that I am searching for. Then, I create 10,000 lists with 50 strings each and store those data in both Polars and pandas DataFrames. For the performance test, I will only evaluate this Boolean condition: “is a word from words_to_find in the list of words for this row of the DataFrame?”

from numpy.random import default_rng

with open('./data/words') as f:
    word_pool = [
        word for l in f
        if (word := l.strip()).islower()
        and len(word) > 3
        and word.isascii()
        and word.isalpha()
    ]

rng = default_rng(0)
words_to_find = list(rng.choice(word_pool, size=1_000))

pd_df = DataFrame(
    {'words': rng.choice(word_pool, size=(10_000, 50)).tolist()}
)
pl_df = from_pandas(pd_df).lazy()

display(
    words_to_find[:5],
    pd_df.head(),
    pl_df.head(5).collect(),
)
['stressful', 'philatelist', 'lungfuls', 'drollest', 'exfoliates']
words
0 [mooting, bisexually, conveyable, tadpole, ind...
1 [devouring, sandhogs, fickleness, pandemonium,...
2 [whispers, infuriates, moratorium, hermaphrodi...
3 [sinkholes, electrics, incarnate, unappropriat...
4 [pasturage, acct, genuflecting, regurgitates, ...
shape: (5, 1)
words
list[str]
["mooting", "bisexually", … "implementations"]
["devouring", "sandhogs", … "toccata"]
["whispers", "infuriates", … "defects"]
["sinkholes", "electrics", … "downloading"]
["pasturage", "acct", … "rocker"]

Let’s set up a timing context manager for some quick tests:

from time import perf_counter, sleep
from contextlib import contextmanager

@contextmanager
def timer():
    start = perf_counter()
    yield (t := [None])
    stop = perf_counter()
    t[0] = stop-start

with timer() as t:
    sleep(1)

print(f'the block of code above took {t[0]:.6f}s')
the block of code above took 1.000987s

Here are all of the tests we are running:

  • pandas using the explode → reduce

  • pandas using .apply

  • Polars using .list.set_difference

  • Polars using explode → reduce

  • Polars using list.eval for elementwise operations

performance_tests = {
    ('pandas', '.explode…groupby'): lambda: 
        pd_df['words']
        .explode()
        .isin(words_to_find)
        .groupby(level=0).any(),
    
    ('pandas', '.apply'): lambda: 
        pd_df['words']
        .apply(lambda v: any(w in v for w in words_to_find)),
    
    ('polars', '.explode…groupby'): lambda: 
        pl_df.with_row_count()
        .explode('words')
        .group_by('row_nr')
        .agg(col('words').is_in(words_to_find).any())
        .select(col('words'))
        .collect(),
    
    ('polars', '.set_difference'): lambda: 
        pl_df.select(
            col('words').list.set_intersection(list(words_to_find))
            .list.lengths() >= 1
        ).collect(),
    
    ('polars', '.element'): lambda:
        pl_df.select(
            col('words').list.eval(pl_element().is_in(words_to_find)).list.any()
        ).collect(),
}

With our tests set up, let’s run each of them to see how long each approach takes. We’ll also be careful to store the result of the operation to ensure each test generates the same set of results.

timing_results, numeric_results = [], []
for (backend, desc), test in performance_tests.items():
    with timer() as t:
        res = test()
    if not isinstance(res, Series): # coerce all results to comparable structures
        res = res.to_series().to_pandas()

    numeric_results.append(res)
    timing_results.append(
        {'backend': backend, 'description': desc, 'time delta': t[0]}
    )

# ensure all tests generated the same set of results
from itertools import pairwise
assert all((a == b).all() for a, b in pairwise(numeric_results))

Looks like everything ran to completion, so let’s see how long each approach took.

(
    pd_DataFrame(timing_results)
    .set_index(['backend', 'description'])
    .sort_values(['time delta'])
).style.background_gradient(cmap='Greens', subset='time delta')
    time delta
backend description  
polars .element 0.015200
.set_difference 0.019428
.explode…groupby 0.033555
pandas .explode…groupby 0.068308
.apply 4.813214

Clearly, Polars has an advantage here. For tests of this size (1,000 targets in 10,000 lists each containing 50 strings), it seems like Polars has a decent advantage regardless of the approach taken. pandas is a bit behind in terms of performance in this operation, but I wouldn’t say that is interesting takeaway. What’s interesting is that:

  1. All of the Polars approaches are relatively similar in their speed

  2. pandas has a LARGE difference in the fastest and slowest approaches

This supports my observation that naive code in Polars will still run fast, whereas naive code in pandas can have detrimental performance impact.

That’s all I have for this week. Until next time!