Python Set vs Pandas.Index#

For the past few weeks, I have been meeting with some fantastic clients in one-on-one sessions to cover the core Python and pandas skills needed to perform rapid data analysis. We have discussed a variety of topics, but this week has been one of my favorites because we are doing a deep dive into pandas. Of course, the framing for pandas is all about the Index, so I decided to keep it light and ensure we tie it back to some core Python concepts.

When discussing the Index in pandas, I always find it useful to contrast it against a Python built-in that exhibits some similar behaviors: the set. This week, I want to focus on each of these data structures to understand where they overlap, their differences, and the lessons they can teach us.

Interested in 1:1 sessions with me and James for your (or your team’s) core Python, pandas, polars, and data visualization skills?
Email us: info@dutc.io for more information about our training options

Sets#

The Python set is a representation of a classic mathematical set. Our sets exhibit two important features:

  1. The elements of a set are unique (no duplicate elements are found in a common set)

  2. Sets are unordered (one cannot predict the ordering of the elements contained within a set)

Let’s look at an example of each:

Unique

{'a', 'a', 'b', 'a', 'c'}
{'a', 'b', 'c'}

Unordered

for elem in {'a', 'b', 'c', 'd'}:
    print(elem)
a
b
c
d

Behaviors

In addition to the above characteristics, sets also enable us to perform algebraic operations. For example, we can use Python sets to quickly address questions about membership. These behaviors are typically visualized with a venn diagram.

set1 = {'a', 'b', 'c', 'd'          }
set2 = {          'c', 'd', 'e', 'f'}

print(
    f'{set1 | set2 = }', # union
    f'{set1 & set2 = }', # intersection
    f'{set1 - set2 = }', # difference
    f'{set2 - set1 = }', # difference (note ordering)
    f'{set1 ^ set2 = }', # symmetric difference
    sep='\n',
)
set1 | set2 = {'b', 'e', 'd', 'a', 'c', 'f'}
set1 & set2 = {'c', 'd'}
set1 - set2 = {'a', 'b'}
set2 - set1 = {'e', 'f'}
set1 ^ set2 = {'b', 'e', 'a', 'f'}

In addition to using the above operators, there are also instance methods we can use to invoke these operations. This can be useful to add a touch of ducktyping to your code to ensure you’re working with an object that supports a set-like vocabulary.

set1 = {'a', 'b', 'c', 'd'          }
set2 = {          'c', 'd', 'e', 'f'}

print(
    f'{set1.union(set2)               = }', # |
    f'{set1.intersection(set2)        = }', # &
    f'{set1.difference(set2)           = }', # -
    f'{set2.difference(set1)           = }', # -
    f'{set1.symmetric_difference(set2) = }', # ^
    sep='\n',
)
set1.union(set2)               = {'b', 'e', 'd', 'a', 'c', 'f'}
set1.intersection(set2)        = {'c', 'd'}
set1.difference(set2)           = {'a', 'b'}
set2.difference(set1)           = {'e', 'f'}
set1.symmetric_difference(set2) = {'b', 'e', 'a', 'f'}

Index - Without Duplicates#

Unlike the Python set, a pandas.Index is inherently sorted (thus, the presence of .sort_index()) and can contain duplicate values. In practice, we may often .pipe(lambda s: s.groupby(s.index)).agg(...) on data loading to eliminate duplicates and to avoid having to worry about the (sometimes confusing) alignment modalities that perform Cartesian products. The ordering of the index is an important property, and you can query it with .is_monotonic_increasing and .is_monotonic_decreasing. Given a known-sorted index, pandas can perform operations such as .loc more efficiently, using an O(logn) binary search instead of an O(n) linear scan.

An underappreciated feature of Index objects is that they also implement the set vocabulary:

from pandas import Index

idx1 = Index(['a', 'b', 'c', 'd'          ])
idx2 = Index([          'c', 'd', 'e', 'f'])

print(
    f'{idx1.union(idx2)               = }',
    f'{idx1.intersection(idx2)        = }',
    f'{idx1.difference(idx2)           = }',
    f'{idx2.difference(idx1)           = }',
    f'{idx1.symmetric_difference(idx2) = }',
    sep='\n',
)
idx1.union(idx2)               = Index(['a', 'b', 'c', 'd', 'e', 'f'], dtype='object')
idx1.intersection(idx2)        = Index(['c', 'd'], dtype='object')
idx1.difference(idx2)           = Index(['a', 'b'], dtype='object')
idx2.difference(idx1)           = Index(['e', 'f'], dtype='object')
idx1.symmetric_difference(idx2) = Index(['a', 'b', 'e', 'f'], dtype='object')

Index - With Duplicates#

This behavior is slightly different if you introduce repeated values in the Index. The union operation will retain the duplicates with the most repetitions in its output, similar to a multi-set.

All other operations discard duplicated values:

from pandas import Index

idx1 = Index(['a', 'a', 'b',                   ])
idx2 = Index([          'b', 'b', 'c', 'c', 'd'])

print(
    f'{idx1.union(idx2)               = }',
    f'{idx1.intersection(idx2)        = }',
    f'{idx1.difference(idx2)           = }',
    f'{idx2.difference(idx1)           = }',
    f'{idx1.symmetric_difference(idx2) = }',
    sep='\n',
)
idx1.union(idx2)               = Index(['a', 'a', 'b', 'b', 'c', 'c', 'd'], dtype='object')
idx1.intersection(idx2)        = Index(['b'], dtype='object')
idx1.difference(idx2)           = Index(['a'], dtype='object')
idx2.difference(idx1)           = Index(['c', 'd'], dtype='object')
idx1.symmetric_difference(idx2) = Index(['a', 'c', 'd'], dtype='object')

While the pandas.Index is not constrained to being unique or sorted, it is very useful to know the state of its contents as these will trigger various fastpaths in indexing operations. For the above index objects, we can see that pandas is aware that its contents are not unique but are monotonically increasing.

print(
    f'{idx1.is_monotonic_increasing = }',
    f'{idx1.is_monotonic_decreasing = }',
    f'{idx1.is_unique               = }',
    sep='\n',
)    
idx1.is_monotonic_increasing = True
idx1.is_monotonic_decreasing = False
idx1.is_unique               = False

Application#

With the discussion out of the way, let’s apply some of what we’ve seen. Taking a sample DataFrame with temperature/precipitation measurements for each day, let’s try to find out what days we are missing from our DataFrame.

from pandas import Series, date_range
from numpy.random import default_rng

rng = default_rng(0)
all_dates = date_range('2000-01-01', '2000-12-31', freq='D', name='date')

df = (
    DataFrame(
        index=all_dates,
        data={
            'temperature': rng.normal(60, scale=5, size=len(all_dates)),
            'precip': rng.uniform(0, 3, size=len(all_dates)), 
        }
    )
    .sample(frac=.99, random_state=rng) # sample 99% of our dataset
    .sort_index()
)

df.head()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [8], in <cell line: 8>()
      4 rng = default_rng(0)
      5 all_dates = date_range('2000-01-01', '2000-12-31', freq='D', name='date')
      7 df = (
----> 8     DataFrame(
      9         index=all_dates,
     10         data={
     11             'temperature': rng.normal(60, scale=5, size=len(all_dates)),
     12             'precip': rng.uniform(0, 3, size=len(all_dates)), 
     13         }
     14     )
     15     .sample(frac=.99, random_state=rng) # sample 99% of our dataset
     16     .sort_index()
     17 )
     19 df.head()

NameError: name 'DataFrame' is not defined

What days are missing?

① Ignoring the .index

While we could use array operations, I wanted to demonstrate a common SQL-oriented approach. Let’s perform a join operation to calculate our set operations (the parallels between different join types and set operations are intentional here).

(
    df.reset_index()
    .merge(
        all_dates.to_frame(index=False), on='date', indicator=True, how='outer'
    )
    .query('_merge == "right_only"')
    ['date']
)

The above approach is standard when one chooses to “ignore the index.” Note that we are doing both a .reset_index and a .to_frame on our constituent tables. This is a sign of “wrestling” with pandas instead of working with it.

Instead, we might choose to…

② Leveraging the .index

all_dates.difference(df.index)

It just that simple.

Wrap-Up#

There we have it: key comparisons between the Index in pandas and Python’s built-in, set.

What do you think about this topic? Let us know on the DUTC Discord server.

That’s all for today. Until next time!