Don’t Forget About the Index!#
This week, we have another question from StackOverflow. The question this week features a pandas problem that looks
tricky on the surface. However, it becomes quite straightforward once your remember to not forget about the .index
.
Specifically, in this problem, we had a data manipulation problem:
id time_stamp_date sth
1 12 col1
2 12 col1
3 12 col2
1 14 col2
1 14 col3
And, we needed to take that data to transform it into this:
col1 col2 col3 col4
id time_stamp_date
1 12 1 0 0 0
2 12 1 0 0 0
3 12 0 1 0 0
1 14 0 1 1 0
On the surface, it looks almost like a type of dummy coding problem that we
could solve with pandas.get_dummies
,
but, if we look closer, there is a very interesting aggregation problem here as well.
Specifically, we want to perform a get_dummies
-like operation that is grouped by
both 'id'
and 'time_stamp_date'
.
But as we can see, this doesn’t exactly get us there…
from pandas import DataFrame, get_dummies
df = DataFrame({
'id':[1,2,3,1, 1],
'time_stamp_date':['12','12', '12', '14', '14'],
'sth':['col1','col1', 'col2','col2', 'col3']
})
df.groupby(['id', 'time_stamp_date'], group_keys=True)['sth'].apply(get_dummies).fillna(0)
col1 | col2 | col3 | |||
---|---|---|---|---|---|
id | time_stamp_date | ||||
1 | 12 | 0 | True | 0 | 0 |
14 | 3 | 0 | True | False | |
4 | 0 | False | True | ||
2 | 12 | 1 | True | 0 | 0 |
3 | 12 | 2 | 0 | True | 0 |
The issue is that we are relying on the assumptions made by the get_dummies
function
to return to us a desired result, simply based on the fact that looks like what we want.
To work with pandas effectively, we need to break down our transformation into distinct
steps instead of relying on a single function to achieve our desired result. We could continue
playing with get_dummies
here, attempting to massage this output into what we want, but I have
found that it is often better to abandon the approach as soon as we realize it does
not work the way we want it to.
Even reconsider the .groupby
approach here: if we want to end up with an index that just has the
unique combinations of 'id'
and 'time_stamp_date'
, we do not necessarily need to aggregate. We can instead
think in terms of our index. Note that, for combinations of our index that exist, we want the value 1
to
appear. All other combinations should be represented by a ‘0’. This is the exact same transformation we
might make when moving from a dense-data format into a sparse-data format.
from pandas import Series, MultiIndex
s = Series(1, MultiIndex.from_frame(df))
s
id time_stamp_date sth
1 12 col1 1
2 12 col1 1
3 12 col2 1
1 14 col2 1
col3 1
dtype: int64
In the above pandas.Series
, we can see that we have coded the value ‘1’ for every
index that we have data for. In order to move 'sth'
into the columns, we can simply .unstack
this level.
sparse_s = s.unstack('sth', fill_value=0)
sparse_s
sth | col1 | col2 | col3 | |
---|---|---|---|---|
id | time_stamp_date | |||
1 | 12 | 1 | 0 | 0 |
14 | 0 | 1 | 1 | |
2 | 12 | 1 | 0 | 0 |
3 | 12 | 0 | 1 | 0 |
The problem becomes almost trivial because we can use the .index
effectively! From here, now we just have an ordering
problem to solve (or a .reindex
problem) to put the data in the order that the question wanted. To do this, we want to
preserve the row-index ordering from the original DataFrame
, and then we want our column-index ordering to represent that of
the sth_list
variable ['col1', 'col2', 'col3', 'col4']
.
sth_list = ['col1', 'col2', 'col3', 'col4']
out = sparse_s.reindex(
columns=sth_list,
index=df[['id', 'time_stamp_date']].drop_duplicates(),
fill_value=0
)
out
sth | col1 | col2 | col3 | col4 | |
---|---|---|---|---|---|
id | time_stamp_date | ||||
1 | 12 | 1 | 0 | 0 | 0 |
2 | 12 | 1 | 0 | 0 | 0 |
3 | 12 | 0 | 1 | 0 | 0 |
1 | 14 | 0 | 1 | 1 | 0 |
And there we have it: proof that thinking in terms of the .index
will not only reduce the total
lines of pandas code you can write, but also that it can speed up your DataFrame
manipulations.
Wrap-Up#
Thanks for checking out my blog post this week. I’ll talk to you all again soon—and, remember: don’t forget about the .index
!