Python Set vs Pandas.Index#
For the past few weeks, I have been meeting with some fantastic clients in
one-on-one sessions to cover the core Python and pandas skills needed to perform
rapid data analysis. We have discussed a variety of topics, but this week has been one
of my favorites because we are doing a deep dive into pandas. Of course, the
framing for pandas is all about the Index
, so I decided to keep it light and
ensure we tie it back to some core Python concepts.
When discussing the Index
in pandas, I always find it useful to contrast it against
a Python built-in that exhibits some similar behaviors: the set
. This week,
I want to focus on each of these data structures to understand where they overlap, their differences, and the lessons they can teach us.
Interested in 1:1 sessions with me and James for your (or your team’s) core Python, pandas, polars, and data visualization skills?
Email us: info@dutc.io for more information about our training options
Sets#
The Python set is a representation of a classic mathematical set. Our sets exhibit two important features:
The elements of a set are unique (no duplicate elements are found in a common set)
Sets are unordered (one cannot predict the ordering of the elements contained within a set)
Let’s look at an example of each:
Unique
{'a', 'a', 'b', 'a', 'c'}
{'a', 'b', 'c'}
Unordered
for elem in {'a', 'b', 'c', 'd'}:
print(elem)
c
a
b
d
Behaviors
In addition to the above characteristics, sets also enable us to perform algebraic operations. For example, we can use Python sets to quickly address questions about membership. These behaviors are typically visualized with a venn diagram.
set1 = {'a', 'b', 'c', 'd' }
set2 = { 'c', 'd', 'e', 'f'}
print(
f'{set1 | set2 = }', # union
f'{set1 & set2 = }', # intersection
f'{set1 - set2 = }', # difference
f'{set2 - set1 = }', # difference (note ordering)
f'{set1 ^ set2 = }', # symmetric difference
sep='\n',
)
set1 | set2 = {'a', 'd', 'e', 'b', 'f', 'c'}
set1 & set2 = {'d', 'c'}
set1 - set2 = {'a', 'b'}
set2 - set1 = {'f', 'e'}
set1 ^ set2 = {'a', 'e', 'b', 'f'}
In addition to using the above operators, there are also instance methods we can use to invoke these operations. This can be useful to add a touch of ducktyping to your code to ensure you’re working with an object that supports a set-like vocabulary.
set1 = {'a', 'b', 'c', 'd' }
set2 = { 'c', 'd', 'e', 'f'}
print(
f'{set1.union(set2) = }', # |
f'{set1.intersection(set2) = }', # &
f'{set1.difference(set2) = }', # -
f'{set2.difference(set1) = }', # -
f'{set1.symmetric_difference(set2) = }', # ^
sep='\n',
)
set1.union(set2) = {'a', 'd', 'e', 'b', 'f', 'c'}
set1.intersection(set2) = {'d', 'c'}
set1.difference(set2) = {'a', 'b'}
set2.difference(set1) = {'f', 'e'}
set1.symmetric_difference(set2) = {'a', 'e', 'b', 'f'}
Index - Without Duplicates#
Unlike the Python set
, a pandas.Index
is inherently sorted (thus, the presence of .sort_index()
) and can contain duplicate values. In practice, we may often .pipe(lambda s: s.groupby(s.index)).agg(...)
on data loading to eliminate duplicates and to avoid having to worry about the (sometimes confusing) alignment modalities that perform Cartesian products. The ordering of the index is an important property, and you can query it with .is_monotonic_increasing
and .is_monotonic_decreasing
. Given a known-sorted index, pandas can perform operations such as .loc
more efficiently, using an O(logn) binary search instead of an O(n) linear scan.
An underappreciated feature of Index objects is that they also implement the set vocabulary:
from pandas import Index
idx1 = Index(['a', 'b', 'c', 'd' ])
idx2 = Index([ 'c', 'd', 'e', 'f'])
print(
f'{idx1.union(idx2) = }',
f'{idx1.intersection(idx2) = }',
f'{idx1.difference(idx2) = }',
f'{idx2.difference(idx1) = }',
f'{idx1.symmetric_difference(idx2) = }',
sep='\n',
)
idx1.union(idx2) = Index(['a', 'b', 'c', 'd', 'e', 'f'], dtype='object')
idx1.intersection(idx2) = Index(['c', 'd'], dtype='object')
idx1.difference(idx2) = Index(['a', 'b'], dtype='object')
idx2.difference(idx1) = Index(['e', 'f'], dtype='object')
idx1.symmetric_difference(idx2) = Index(['a', 'b', 'e', 'f'], dtype='object')
Index - With Duplicates#
This behavior is slightly different if you introduce repeated values in the Index.
The union
operation will retain the duplicates with the most repetitions
in its output, similar to a multi-set.
All other operations discard duplicated values:
from pandas import Index
idx1 = Index(['a', 'a', 'b', ])
idx2 = Index([ 'b', 'b', 'c', 'c', 'd'])
print(
f'{idx1.union(idx2) = }',
f'{idx1.intersection(idx2) = }',
f'{idx1.difference(idx2) = }',
f'{idx2.difference(idx1) = }',
f'{idx1.symmetric_difference(idx2) = }',
sep='\n',
)
idx1.union(idx2) = Index(['a', 'a', 'b', 'b', 'c', 'c', 'd'], dtype='object')
idx1.intersection(idx2) = Index(['b'], dtype='object')
idx1.difference(idx2) = Index(['a'], dtype='object')
idx2.difference(idx1) = Index(['c', 'd'], dtype='object')
idx1.symmetric_difference(idx2) = Index(['a', 'c', 'd'], dtype='object')
While the pandas.Index
is not constrained to being unique or sorted, it is very
useful to know the state of its contents as these will trigger various fastpaths
in indexing operations. For the above index objects, we can see that pandas
is aware that its contents are not unique but are monotonically increasing.
print(
f'{idx1.is_monotonic_increasing = }',
f'{idx1.is_monotonic_decreasing = }',
f'{idx1.is_unique = }',
sep='\n',
)
idx1.is_monotonic_increasing = True
idx1.is_monotonic_decreasing = False
idx1.is_unique = False
Application#
With the discussion out of the way, let’s apply some of what we’ve seen. Taking a sample DataFrame with temperature/precipitation measurements for each day, let’s try to find out what days we are missing from our DataFrame.
from pandas import DataFrame, Series, date_range
from numpy.random import default_rng
rng = default_rng(0)
all_dates = date_range('2000-01-01', '2000-12-31', freq='D', name='date')
df = (
DataFrame(
index=all_dates,
data={
'temperature': rng.normal(60, scale=5, size=len(all_dates)),
'precip': rng.uniform(0, 3, size=len(all_dates)),
}
)
.sample(frac=.99, random_state=rng) # sample 99% of our dataset
.sort_index()
)
df.head()
temperature | precip | |
---|---|---|
date | ||
2000-01-01 | 60.628651 | 1.931114 |
2000-01-02 | 59.339476 | 2.177273 |
2000-01-03 | 63.202113 | 0.248426 |
2000-01-04 | 60.524501 | 1.058230 |
2000-01-05 | 57.321653 | 1.559499 |
What days are missing?
① Ignoring the .index
While we could use array operations, I wanted to demonstrate a common SQL-oriented approach. Let’s perform a join operation to calculate our set operations (the parallels between different join types and set operations are intentional here).
(
df.reset_index()
.merge(
all_dates.to_frame(index=False), on='date', indicator=True, how='outer'
)
.query('_merge == "right_only"')
['date']
)
11 2000-01-12
44 2000-02-14
88 2000-03-29
207 2000-07-26
Name: date, dtype: datetime64[ns]
The above approach is standard when one chooses to “ignore the index.” Note that
we are doing both a .reset_index
and a .to_frame
on our constituent tables.
This is a sign of “wrestling” with pandas instead of working with it.
Instead, we might choose to…
② Leveraging the .index
all_dates.difference(df.index)
DatetimeIndex(['2000-01-12', '2000-02-14', '2000-03-29', '2000-07-26'], dtype='datetime64[ns]', name='date', freq=None)
It just that simple.
Wrap-Up#
There we have it: key comparisons between the Index
in pandas and Python’s built-in, set
.
What do you think about this topic? Let us know on the DUTC Discord server.
That’s all for today. Until next time!