Don’t Forget About the Index!#

This week, we have another question from StackOverflow. The question this week features a pandas problem that looks tricky on the surface. However, it becomes quite straightforward once your remember to not forget about the .index.

Specifically, in this problem, we had a data manipulation problem:

 id  time_stamp_date sth
 12              col1
 12              col1
 12              col2
 14              col2
 14              col3

And, we needed to take that data to transform it into this:

                    col1  col2  col3  col4
id time_stamp_date                        
1  12                  1     0     0     0
2  12                  1     0     0     0
3  12                  0     1     0     0
1  14                  0     1     1     0

On the surface, it looks almost like a type of dummy coding problem that we could solve with pandas.get_dummies, but, if we look closer, there is a very interesting aggregation problem here as well. Specifically, we want to perform a get_dummies-like operation that is grouped by both 'id' and 'time_stamp_date'.

But as we can see, this doesn’t exactly get us there…

from pandas import DataFrame, get_dummies

df = DataFrame({
    'id':[1,2,3,1, 1],
    'time_stamp_date':['12','12', '12', '14', '14'],
    'sth':['col1','col1', 'col2','col2', 'col3']
})

df.groupby(['id', 'time_stamp_date'], group_keys=True)['sth'].apply(get_dummies).fillna(0)

			col1	col2	col3
id	time_stamp_date
1	12	0	1.0	0.0	0.0
	14	3	0.0	1.0	0.0
	14	4	0.0	0.0	1.0
2	12	1	1.0	0.0	0.0
3	12	2	0.0	1.0	0.0

The issue is that we are relying on the assumptions made by the get_dummies function to return to us a desired result, simply based on the fact that looks like what we want.

To work with pandas effectively, we need to break down our transformation into distinct steps instead of relying on a single function to achieve our desired result. We could continue playing with get_dummies here, attempting to massage this output into what we want, but I have found that it is often better to abandon the approach as soon as we realize it does not work the way we want it to.

Even reconsider the .groupby approach here: if we want to end up with an index that just has the unique combinations of 'id' and 'time_stamp_date', we do not necessarily need to aggregate. We can instead think in terms of our index. Note that, for combinations of our index that exist, we want the value 1 to appear. All other combinations should be represented by a ‘0’. This is the exact same transformation we might make when moving from a dense-data format into a sparse-data format.

from pandas import Series, MultiIndex

s = Series(1, MultiIndex.from_frame(df))
s

id  time_stamp_date  sth 
1   12               col1    1
2   12               col1    1
3   12               col2    1
1   14               col2    1
                     col3    1
dtype: int64

In the above pandas.Series, we can see that we have coded the value ‘1’ for every index that we have data for. In order to move 'sth' into the columns, we can simply .unstack this level.

sparse_s = s.unstack('sth', fill_value=0)
sparse_s

	sth	col1	col2	col3
id	time_stamp_date
1	12	1	0	0
1	14	0	1	1
2	12	1	0	0
3	12	0	1	0

The problem becomes almost trivial because we can use the .index effectively! From here, now we just have an ordering problem to solve (or a .reindex problem) to put the data in the order that the question wanted. To do this, we want to preserve the row-index ordering from the original DataFrame, and then we want our column-index ordering to represent that of the sth_list variable ['col1', 'col2', 'col3', 'col4'].

sth_list = ['col1', 'col2', 'col3', 'col4']
out = sparse_s.reindex(
    columns=sth_list,
    index=df[['id', 'time_stamp_date']].drop_duplicates(),
    fill_value=0
)

out

	sth	col1	col2	col3	col4
id	time_stamp_date
1	12	1	0	0	0
2	12	1	0	0	0
3	12	0	1	0	0
1	14	0	1	1	0

And there we have it: proof that thinking in terms of the .index will not only reduce the total lines of pandas code you can write, but also that it can speed up your DataFrame manipulations.

Wrap-Up#

Thanks for checking out my blog post this week. I’ll talk to you all again soon—and, remember: don’t forget about the .index!

Why is DataFrame.corr() so much slower than numpy.corrcoef? Exploring Weather with Bokeh!

Recent Posts

Tags

Don’t Forget About the Index!#

Wrap-Up#