What the Index?#
Hello, world! My schedule is jam-packed this week getting ready for my upcoming seminar, “Spot the Lies Told by this Data,” but even that can’t take me away from Cameron’s Corner! This week, I want to discuss my old friend, the Index
.
I’ve taught pandas to numerous colleagues and clients, and the most important
lesson to learn when working with this tool is to always respect the Index
.
Why have an Index?#
You always need to keep the Index
in mind, whether you’re running a quick analysis or creating complex pipeline of pandas.DataFrames
.
The Index
is the basis of every single operation in your pandas code. The first
step aligns the .index
of one object against the .index
of another. This
is conceptually similar to performing an outer join between two or more entities
prior to working on them.
By automating this step as part of our work, we are less prone to alignment
errors. On the other hand, index-alignment is a source of mysterious NaN
s if
you do not know it exists:
from pandas import Series
s1 = Series([1, 2, 3 ], index=[*'abc'], name='s1')
s2 = Series([10, 11, 12], index=[*'bcd'], name='s2')
s1 + s2
a NaN
b 12.0
c 14.0
d NaN
dtype: float64
You can see that this simple addition operation resulted in a few NaN
values
but also carried out the addition on two of our elements! The reason we didn’t
seemingly add each element is due to index alignment.
Let’s break this process down:
from pandas import concat
concat(s1.align(s2), axis='columns')
s1 | s2 | |
---|---|---|
a | 1.0 | NaN |
b | 2.0 | 10.0 |
c | 3.0 | 11.0 |
d | NaN | 12.0 |
Above, we can see how these two Series
align against each other. When performing
any operation, we first much step through index alignment. The reason we observed
NaN
values from our addition is simply because s2
did not have a value at index
label 'a'
. Likewise, Series
s1
did not have a value at index label 'd'
.
By labeling our data and keeping track of the Index
, we are less likely to
add (or apply any other operation to) these entities that will result in spurious results. It is much easier to address “Why are the results of this operations NaN
?” than it is to debug seemingly correct results.
Slicing with an Index#
Aside from alignment, we can also perform filtering on the Index. Since pandas
is an in-memory representation of data, we can gain large performance benefits
from designing our queries around the Index
.
print(
f"{s2.loc['b'] = }", # label-based accession
f'{s2.iloc[0] = }', # position-based accession
sep='\n'
)
s2.loc['b'] = 10
s2.iloc[0] = 10
Using .loc
, we can select values out of our Series
according to the its associated
Index label. This is akin to selecting a value out of a dictionary. Comparatively,
we can use .iloc
to select a value based on its position in the Series
, ignoring
the Index
.
from pandas import DataFrame, MultiIndex
index = MultiIndex.from_product(
[['a', 'b', 'c'], [*range(20000)]], names=['level_0', 'level_1']
)
df = DataFrame(
{'value': [*range(len(index))]},
index=index
)
df.head()
value | ||
---|---|---|
level_0 | level_1 | |
a | 0 | 0 |
1 | 1 | |
2 | 2 | |
3 | 3 | |
4 | 4 |
Slicing On the Index
%timeit -n 3 df.loc['b']
309 µs ± 143 µs per loop (mean ± std. dev. of 7 runs, 3 loops each)
Slicing With a Boolean Masking
no_index_df = df.reset_index()
%timeit -n 3 no_index_df.loc[no_index_df['level_0'] == 'a']
4.02 ms ± 278 µs per loop (mean ± std. dev. of 7 runs, 3 loops each)
We can see that the Index has large speed benefits when using it for subsetting/slicing operations. This is because the Index is fairly intelligent and knows some very important information about itself. This knowledge does come at a price, though, as it can make the index a little heavy—or much less effective—if it does not meet certain assumptions:
print(
f'{s1.index.is_monotonic_increasing = }',
f'{s1.index.is_monotonic_decreasing = }',
f'{s1.index.is_unique = }',
sep='\n'
)
s1.index.is_monotonic_increasing = True
s1.index.is_monotonic_decreasing = False
s1.index.is_unique = True
This is some of the information that pandas uses to optimize Index-based
operations. My favorite part is that the Index
isn’t just limited to hardware-backed datatypes. pandas has created abstractions to meaningfully index many
different types of tabular data:
Pandas Index Types
RangeIndex – an Index implementing a monotonic integer range
CategoricalIndex – an Index of Categorical s
MultiIndex – A multi-level, or hierarchical Index
IntervalIndex – An Index of Interval s
DatetimeIndex – Index of datetime64 data
TimedeltaIndex – Index of timedelta64 data
PeriodIndex – Index of Period data
NumericIndex – Index of numpy int/uint/float data
Each of these Index types probably warrants a blog post on its own, so I’ll leave that for a future post!
Wrap Up#
That’s all for this week. Thanks for tuning in to learn a little bit more about
the pandas.Index
and the power it gives your analysis code. And, don’t forget: the most important lesson to learn when working with this tool is to always respect the Index
!