DataFrame Value Membership Testing

This week, I received a great question on our Discord Server about finding strings within a list in a pandas.Series.

But, before I get started, I want to invite you to our upcoming µtraining (“micro-training”) that we will be hosting on December 19th and 21st. This unique training format ensures direct interaction with instructors and your peers, providing practical insights and immediate problem-solving guidance.

Ready to reshape your Python journey? Join us for two days of immersive learning, where expertise meets real-world application!

Now, back to the question:

In pandas, if we have a column of strings, we can check to see if each of the elements are in a given set, thanks to the the isin method. What if the column contains a list of strings instead? How could we check to see if this list is a subset of a given reference list. Should we use explode()…isin() and then aggregate again or are there better ways to do it?

In each of the following examples, I will create a DataFrame with a column of 'words'. In some examples, the 'words' column will simply contain a string in each cell of the Series. Then, we’ll move to the more complex problem of nested strings in a Series.

Simple String Finding

The simplest version of this problem removes the nested "string within a list" component. There are a few ways to see if a string matches multiple targets in pandas. We can either...

perform an .isin check
concatenate the targets into a regular expression and use .str.match

from pandas import DataFrame as pd_DataFrame, Series

words_to_find = ['phenomenal', 'drowse']
df = pd_DataFrame({'words': [
    'stria', 'phenomenal', 'lungs', 'drowse', 'exhorting'
]})

df.assign(
    multi_check=lambda d: d['words'].isin(words_to_find),
    multi_check_re=lambda d: d['words'].str.match(r'|'.join(words_to_find)),
).style.map( # formatting for display purposes
    lambda v: 'background-color: green' if v is True else ''
)

	words	multi_check	multi_check_re
0	stria	False	False
1	phenomenal	True	True
2	lungs	False	False
3	drowse	True	True
4	exhorting	False	False

Both approaches look fairly straightforward. But, what if our DataFrame consists of a list of values?

Nested String Finding - Pandas

In this problem, our 'words' Series contains lists of strings. Since this is backed by the 'object' dtype, we will expect its performance to be on par with base Python. Additionally, I was curious to see if I could do a few different things:

Indicate whether or not one of the words in the DataFrame matches a word in the words_to_find
Count the number of matches between each row in the DataFrame and the words in words_to_find
Filter the words in each row of the DataFrame down to only those that match

To do each of these, I took two primary approaches:

.explode the nested data, then reduce back down to the original index
.apply with an object dtype, which doesn't require doing much else

from pandas import DataFrame as DataFrame

words_to_find = ['phenomenal', 'lungs', 'drowse', 'snack']
df = DataFrame({
    'words': [
        ['stria', 'phenomenal', 'lungs'],
        ['drowse', 'exhorting', 'ark'],
        ['bespeaking', 'aggregates', 'confute'],
        ['snack', 'pleasingly', 'trench'],
        ['loather', 'overmodest', 'wardress']
    ]
})

df.assign(
    explode_overlap=lambda d: 
        d['words'].explode().isin(words_to_find).groupby(level=0).any(),
    explode_counting=lambda d: 
        d['words'].explode().isin(words_to_find).groupby(level=0).sum(),
    explode_intersection=lambda d: 
        d['words'].explode()
        .loc[lambda d: d.isin(words_to_find)]
        .groupby(level=0).apply(list),
    apply_overlap=lambda d: 
        d['words'].apply(lambda v: any(w in v for w in words_to_find)),
    apply_counting=lambda d: 
        d['words'].apply(lambda v: sum(w in v for w in words_to_find)),
    apply_intersection=lambda d:
        d['words'].apply(lambda v: [w for w in words_to_find if w in v]),
)

	words	explode_overlap	explode_counting	explode_intersection	apply_overlap	apply_counting	apply_intersection
0	[stria, phenomenal, lungs]	True	2	[phenomenal, lungs]	True	2	[phenomenal, lungs]
1	[drowse, exhorting, ark]	True	1	[drowse]	True	1	[drowse]
2	[bespeaking, aggregates, confute]	False	0	NaN	False	0	[]
3	[snack, pleasingly, trench]	True	1	[snack]	True	1	[snack]
4	[loather, overmodest, wardress]	False	0	NaN	False	0	[]

Nested String Finding - Polars

Polars provides access to a List datatype. This is a powerful feature that is sure to speed up the search we are trying to do. In this section, I will take the same approaches as above and leverage relevant Polars expressions. The .explode approach ends up being much more verbose than the other two approaches, but we can see the convenience that the List datatype offers: we can perform list.set_intersection to find which words in each row of the DataFrame intersect with our words_to_find. We can also apply element-wise operations via .list.eval which should be much faster than our naive .apply approach in pandas.

from polars import from_pandas, col, element as pl_element

(
    from_pandas(df)
    .with_row_index()
    .explode('words')
    .group_by('index').agg(
        words=col('words'),
        explode_intersection=col('words').filter(col('words').is_in(words_to_find)),
    )
    .with_columns(
        set_intersection=col('words').list.set_intersection(words_to_find),
        element_overlap=(
            col('words').list.eval(pl_element().is_in(words_to_find)).list.any()
        ),
    )
)

shape: (5, 5)

index	words	explode_intersection	set_intersection	element_overlap
u32	list[str]	list[str]	list[str]	bool
0	["stria", "phenomenal", "lungs"]	["phenomenal", "lungs"]	["phenomenal", "lungs"]	true
1	["drowse", "exhorting", "ark"]	["drowse"]	["drowse"]	true
2	["bespeaking", "aggregates", "confute"]	[]	[]	false
3	["snack", "pleasingly", "trench"]	["snack"]	["snack"]	true
4	["loather", "overmodest", "wardress"]	[]	[]	false

Performance Testing

Now that we've discussed some approaches to this problem, let's test it out on a larger dataset and then compare the runtime of each.

In the below setup, I create a list of 1,000 words that I am searching for. Then, I create 10,000 lists with 50 strings each and store those data in both Polars and pandas DataFrames. For the performance test, I will only evaluate this Boolean condition: "is a word from words_to_find in the list of words for this row of the DataFrame?"

from numpy.random import default_rng

with open('./data/words') as f:
    word_pool = [
        word for l in f
        if (word := l.strip()).islower()
        and len(word) > 3
        and word.isascii()
        and word.isalpha()
    ]

rng = default_rng(0)
words_to_find = list(rng.choice(word_pool, size=1_000))

pd_df = DataFrame(
    {'words': rng.choice(word_pool, size=(10_000, 50)).tolist()}
)
pl_df = from_pandas(pd_df).lazy()

display(
    words_to_find[:5],
    pd_df.head(),
    pl_df.head(5).collect(),
)

[np.str_('stressful'),
 np.str_('philatelist'),
 np.str_('lungfuls'),
 np.str_('drollest'),
 np.str_('exfoliates')]

	words
0	[mooting, bisexually, conveyable, tadpole, ind...
1	[devouring, sandhogs, fickleness, pandemonium,...
2	[whispers, infuriates, moratorium, hermaphrodi...
3	[sinkholes, electrics, incarnate, unappropriat...
4	[pasturage, acct, genuflecting, regurgitates, ...

shape: (5, 1)

words
list[str]
["mooting", "bisexually", … "implementations"]
["devouring", "sandhogs", … "toccata"]
["whispers", "infuriates", … "defects"]
["sinkholes", "electrics", … "downloading"]
["pasturage", "acct", … "rocker"]

Let’s set up a timing context manager for some quick tests:

from time import perf_counter, sleep
from contextlib import contextmanager

@contextmanager
def timer():
    start = perf_counter()
    yield (t := [None])
    stop = perf_counter()
    t[0] = stop-start

with timer() as t:
    sleep(1)

print(f'the block of code above took {t[0]:.6f}s')

the block of code above took 1.000143s

Here are all of the tests we are running:

pandas using the explode → reduce
pandas using .apply
Polars using .list.set_difference
Polars using explode → reduce
Polars using list.eval for elementwise operations

performance_tests = {
    ('pandas', '.explode…groupby'): lambda: 
        pd_df['words']
        .explode()
        .isin(words_to_find)
        .groupby(level=0).any(),
    
    ('pandas', '.apply'): lambda: 
        pd_df['words']
        .apply(lambda v: any(w in v for w in words_to_find)),
    
    ('polars', '.explode…groupby'): lambda: 
        pl_df.with_row_index()
        .explode('words')
        .group_by('index')
        .agg(col('words').is_in(words_to_find).any())
        .select(col('words'))
        .collect(),
    
    ('polars', '.set_difference'): lambda: 
        pl_df.select(
            col('words').list.set_intersection(list(words_to_find))
            .list.len() >= 1
        ).collect(),
    
    ('polars', '.element'): lambda:
        pl_df.select(
            col('words').list.eval(pl_element().is_in(words_to_find)).list.any()
        ).collect(),
}

With our tests set up, let's run each of them to see how long each approach takes. We’ll also be careful to store the result of the operation to ensure each test generates the same set of results.

timing_results, numeric_results = [], []
for (backend, desc), test in performance_tests.items():
    with timer() as t:
        res = test()
    if not isinstance(res, Series): # coerce all results to comparable structures
        res = res.to_series().to_pandas()

    numeric_results.append(res)
    timing_results.append(
        {'backend': backend, 'description': desc, 'time delta': t[0]}
    )

# ensure all tests generated the same set of results
from itertools import pairwise
for left, right in pairwise(numeric_results):
    assert left.equals(right)

Looks like everything ran to completion, so let’s see how long each approach took.

(
    pd_DataFrame(timing_results)
    .set_index(['backend', 'description'])
    .sort_values(['time delta'])
).style.background_gradient(cmap='Greens', subset='time delta')

		time delta
backend	description
polars	.element	0.011216
	.set_difference	0.024054
	.explode…groupby	0.029014
pandas	.explode…groupby	0.067736
pandas	.apply	8.723128

Clearly, Polars has an advantage here. For tests of this size (1,000 targets in 10,000 lists each containing 50 strings), it seems like Polars has a decent advantage regardless of the approach taken. pandas is a bit behind in terms of performance in this operation, but I wouldn't say that is interesting takeaway. What's interesting is that:

All of the Polars approaches are relatively similar in their speed
pandas has a LARGE difference in the fastest and slowest approaches

This supports my observation that naive code in Polars will still run fast, whereas naive code in pandas can have detrimental performance impact.

That's all I have for this week. Until next time!

DataFrame Value Membership Testing

Simple String Finding

Nested String Finding - Pandas

Nested String Finding - Polars

Performance Testing

Table of Contents