What the Index?#

Hello, world! My schedule is jam-packed this week getting ready for my upcoming seminar, “Spot the Lies Told by this Data,” but even that can’t take me away from Cameron’s Corner! This week, I want to discuss my old friend, the Index.

I’ve taught pandas to numerous colleagues and clients, and the most important lesson to learn when working with this tool is to always respect the Index.

Why have an Index?#

You always need to keep the Index in mind, whether you’re running a quick analysis or creating complex pipeline of pandas.DataFrames.

The Index is the basis of every single operation in your pandas code. The first step aligns the .index of one object against the .index of another. This is conceptually similar to performing an outer join between two or more entities prior to working on them.

By automating this step as part of our work, we are less prone to alignment errors. On the other hand, index-alignment is a source of mysterious NaNs if you do not know it exists:

from pandas import Series

s1 = Series([1,  2,  3 ], index=[*'abc'], name='s1')
s2 = Series([10, 11, 12], index=[*'bcd'], name='s2')

s1 + s2
a     NaN
b    12.0
c    14.0
d     NaN
dtype: float64

You can see that this simple addition operation resulted in a few NaN values but also carried out the addition on two of our elements! The reason we didn’t seemingly add each element is due to index alignment.

Let’s break this process down:

from pandas import concat

concat(s1.align(s2), axis='columns')
s1 s2
a 1.0 NaN
b 2.0 10.0
c 3.0 11.0
d NaN 12.0

Above, we can see how these two Series align against each other. When performing any operation, we first much step through index alignment. The reason we observed NaN values from our addition is simply because s2 did not have a value at index label 'a'. Likewise, Series s1 did not have a value at index label 'd'.

By labeling our data and keeping track of the Index, we are less likely to add (or apply any other operation to) these entities that will result in spurious results. It is much easier to address “Why are the results of this operations NaN?” than it is to debug seemingly correct results.

Slicing with an Index#

Aside from alignment, we can also perform filtering on the Index. Since pandas is an in-memory representation of data, we can gain large performance benefits from designing our queries around the Index.

print(
    f"{s2.loc['b'] = }", # label-based accession
    f'{s2.iloc[0]  = }',  # position-based accession
    sep='\n'
)
s2.loc['b'] = 10
s2.iloc[0]  = 10

Using .loc, we can select values out of our Series according to the its associated Index label. This is akin to selecting a value out of a dictionary. Comparatively, we can use .iloc to select a value based on its position in the Series, ignoring the Index.

from pandas import DataFrame, MultiIndex

index = MultiIndex.from_product(
    [['a', 'b', 'c'], [*range(20000)]], names=['level_0', 'level_1']
)

df = DataFrame(
    {'value': [*range(len(index))]},
    index=index
)

df.head()
value
level_0 level_1
a 0 0
1 1
2 2
3 3
4 4

Slicing On the Index

%timeit -n 3 df.loc['b']
The slowest run took 4.65 times longer than the fastest. This could mean that an intermediate result is being cached.
371 µs ± 230 µs per loop (mean ± std. dev. of 7 runs, 3 loops each)

Slicing With a Boolean Masking

no_index_df = df.reset_index()
%timeit -n 3 no_index_df.loc[no_index_df['level_0'] == 'a']
3.75 ms ± 391 µs per loop (mean ± std. dev. of 7 runs, 3 loops each)

We can see that the Index has large speed benefits when using it for subsetting/slicing operations. This is because the Index is fairly intelligent and knows some very important information about itself. This knowledge does come at a price, though, as it can make the index a little heavy—or much less effective—if it does not meet certain assumptions:

print(
    f'{s1.index.is_monotonic = }',
    f'{s1.index.is_unique    = }',
    sep='\n'
)
s1.index.is_monotonic = True
s1.index.is_unique    = True
/tmp/ipykernel_905905/1193546709.py:2: FutureWarning: is_monotonic is deprecated and will be removed in a future version. Use is_monotonic_increasing instead.
  f'{s1.index.is_monotonic = }',

This is some of the information that pandas uses to optimize Index-based operations. My favorite part is that the Index isn’t just limited to hardware-backed datatypes. pandas has created abstractions to meaningfully index many different types of tabular data:

Pandas Index Types

  • RangeIndex – an Index implementing a monotonic integer range

  • CategoricalIndex – an Index of Categorical s

  • MultiIndex – A multi-level, or hierarchical Index

  • IntervalIndex – An Index of Interval s

  • DatetimeIndex – Index of datetime64 data

  • TimedeltaIndex – Index of timedelta64 data

  • PeriodIndex – Index of Period data

  • NumericIndex – Index of numpy int/uint/float data

Each of these Index types probably warrants a blog post on its own, so I’ll leave that for a future post!

Wrap Up#

That’s all for this week. Thanks for tuning in to learn a little bit more about the pandas.Index and the power it gives your analysis code. And, don’t forget: the most important lesson to learn when working with this tool is to always respect the Index!