# Python Set vs Pandas.Index#

For the past few weeks, I have been meeting with some fantastic clients in one-on-one sessions to cover the core Python and pandas skills needed to perform rapid data analysis. We have discussed a variety of topics, but this week has been one of my favorites because we are doing a deep dive into pandas. Of course, the framing for pandas is all about the `Index`, so I decided to keep it light and ensure we tie it back to some core Python concepts.

When discussing the `Index` in pandas, I always find it useful to contrast it against a Python built-in that exhibits some similar behaviors: the `set`. This week, I want to focus on each of these data structures to understand where they overlap, their differences, and the lessons they can teach us.

Interested in 1:1 sessions with me and James for your (or your team’s) core Python, pandas, polars, and data visualization skills?

## Sets#

The Python set is a representation of a classic mathematical set. Our sets exhibit two important features:

1. The elements of a set are unique (no duplicate elements are found in a common set)

2. Sets are unordered (one cannot predict the ordering of the elements contained within a set)

Let’s look at an example of each:

Unique

```{'a', 'a', 'b', 'a', 'c'}
```
```{'a', 'b', 'c'}
```

Unordered

```for elem in {'a', 'b', 'c', 'd'}:
print(elem)
```
```a
b
c
d
```

Behaviors

In addition to the above characteristics, sets also enable us to perform algebraic operations. For example, we can use Python sets to quickly address questions about membership. These behaviors are typically visualized with a venn diagram.

```set1 = {'a', 'b', 'c', 'd'          }
set2 = {          'c', 'd', 'e', 'f'}

print(
f'{set1 | set2 = }', # union
f'{set1 & set2 = }', # intersection
f'{set1 - set2 = }', # difference
f'{set2 - set1 = }', # difference (note ordering)
f'{set1 ^ set2 = }', # symmetric difference
sep='\n',
)
```
```set1 | set2 = {'b', 'e', 'd', 'a', 'c', 'f'}
set1 & set2 = {'c', 'd'}
set1 - set2 = {'a', 'b'}
set2 - set1 = {'e', 'f'}
set1 ^ set2 = {'b', 'e', 'a', 'f'}
```

In addition to using the above operators, there are also instance methods we can use to invoke these operations. This can be useful to add a touch of ducktyping to your code to ensure you’re working with an object that supports a set-like vocabulary.

```set1 = {'a', 'b', 'c', 'd'          }
set2 = {          'c', 'd', 'e', 'f'}

print(
f'{set1.union(set2)               = }', # |
f'{set1.intersection(set2)        = }', # &
f'{set1.difference(set2)           = }', # -
f'{set2.difference(set1)           = }', # -
f'{set1.symmetric_difference(set2) = }', # ^
sep='\n',
)
```
```set1.union(set2)               = {'b', 'e', 'd', 'a', 'c', 'f'}
set1.intersection(set2)        = {'c', 'd'}
set1.difference(set2)           = {'a', 'b'}
set2.difference(set1)           = {'e', 'f'}
set1.symmetric_difference(set2) = {'b', 'e', 'a', 'f'}
```

## Index - Without Duplicates#

Unlike the Python `set`, a `pandas.Index` is inherently sorted (thus, the presence of `.sort_index()`) and can contain duplicate values. In practice, we may often `.pipe(lambda s: s.groupby(s.index)).agg(...)` on data loading to eliminate duplicates and to avoid having to worry about the (sometimes confusing) alignment modalities that perform Cartesian products. The ordering of the index is an important property, and you can query it with `.is_monotonic_increasing` and `.is_monotonic_decreasing`. Given a known-sorted index, pandas can perform operations such as `.loc` more efficiently, using an O(logn) binary search instead of an O(n) linear scan.

An underappreciated feature of Index objects is that they also implement the set vocabulary:

```from pandas import Index

idx1 = Index(['a', 'b', 'c', 'd'          ])
idx2 = Index([          'c', 'd', 'e', 'f'])

print(
f'{idx1.union(idx2)               = }',
f'{idx1.intersection(idx2)        = }',
f'{idx1.difference(idx2)           = }',
f'{idx2.difference(idx1)           = }',
f'{idx1.symmetric_difference(idx2) = }',
sep='\n',
)
```
```idx1.union(idx2)               = Index(['a', 'b', 'c', 'd', 'e', 'f'], dtype='object')
idx1.intersection(idx2)        = Index(['c', 'd'], dtype='object')
idx1.difference(idx2)           = Index(['a', 'b'], dtype='object')
idx2.difference(idx1)           = Index(['e', 'f'], dtype='object')
idx1.symmetric_difference(idx2) = Index(['a', 'b', 'e', 'f'], dtype='object')
```

## Index - With Duplicates#

This behavior is slightly different if you introduce repeated values in the Index. The `union` operation will retain the duplicates with the most repetitions in its output, similar to a multi-set.

All other operations discard duplicated values:

```from pandas import Index

idx1 = Index(['a', 'a', 'b',                   ])
idx2 = Index([          'b', 'b', 'c', 'c', 'd'])

print(
f'{idx1.union(idx2)               = }',
f'{idx1.intersection(idx2)        = }',
f'{idx1.difference(idx2)           = }',
f'{idx2.difference(idx1)           = }',
f'{idx1.symmetric_difference(idx2) = }',
sep='\n',
)
```
```idx1.union(idx2)               = Index(['a', 'a', 'b', 'b', 'c', 'c', 'd'], dtype='object')
idx1.intersection(idx2)        = Index(['b'], dtype='object')
idx1.difference(idx2)           = Index(['a'], dtype='object')
idx2.difference(idx1)           = Index(['c', 'd'], dtype='object')
idx1.symmetric_difference(idx2) = Index(['a', 'c', 'd'], dtype='object')
```

While the `pandas.Index` is not constrained to being unique or sorted, it is very useful to know the state of its contents as these will trigger various fastpaths in indexing operations. For the above index objects, we can see that pandas is aware that its contents are not unique but are monotonically increasing.

```print(
f'{idx1.is_monotonic_increasing = }',
f'{idx1.is_monotonic_decreasing = }',
f'{idx1.is_unique               = }',
sep='\n',
)
```
```idx1.is_monotonic_increasing = True
idx1.is_monotonic_decreasing = False
idx1.is_unique               = False
```

## Application#

With the discussion out of the way, let’s apply some of what we’ve seen. Taking a sample DataFrame with temperature/precipitation measurements for each day, let’s try to find out what days we are missing from our DataFrame.

```from pandas import Series, date_range
from numpy.random import default_rng

rng = default_rng(0)
all_dates = date_range('2000-01-01', '2000-12-31', freq='D', name='date')

df = (
DataFrame(
index=all_dates,
data={
'temperature': rng.normal(60, scale=5, size=len(all_dates)),
'precip': rng.uniform(0, 3, size=len(all_dates)),
}
)
.sample(frac=.99, random_state=rng) # sample 99% of our dataset
.sort_index()
)

```
```---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [8], in <cell line: 8>()
4 rng = default_rng(0)
5 all_dates = date_range('2000-01-01', '2000-12-31', freq='D', name='date')
7 df = (
----> 8     DataFrame(
9         index=all_dates,
10         data={
11             'temperature': rng.normal(60, scale=5, size=len(all_dates)),
12             'precip': rng.uniform(0, 3, size=len(all_dates)),
13         }
14     )
15     .sample(frac=.99, random_state=rng) # sample 99% of our dataset
16     .sort_index()
17 )

NameError: name 'DataFrame' is not defined
```

What days are missing?

① Ignoring the `.index`

While we could use array operations, I wanted to demonstrate a common SQL-oriented approach. Let’s perform a join operation to calculate our set operations (the parallels between different join types and set operations are intentional here).

```(
df.reset_index()
.merge(
all_dates.to_frame(index=False), on='date', indicator=True, how='outer'
)
.query('_merge == "right_only"')
['date']
)
```

The above approach is standard when one chooses to “ignore the index.” Note that we are doing both a `.reset_index` and a `.to_frame` on our constituent tables. This is a sign of “wrestling” with pandas instead of working with it.

② Leveraging the `.index`

```all_dates.difference(df.index)
```

It just that simple.

## Wrap-Up#

There we have it: key comparisons between the `Index` in pandas and Python’s built-in, `set`.