Table of Contents

pandas reshaping functions you didn't know about

In a previous blog post, Wide vs. Long Data in pandas — an Introduction, I introduced the topic of reshaping your tabular data from wide to long format using the index-aware .stack() & .unstack() or the data-oriented .melt() and .pivot(). However, pandas has a couple of convenient reshaping methods that you should know about that simplify your life more than those aforementioned methods.

Today, we'll cover...

pivot_table

The pandas .pivot_table enables us to easily transform our data from a long to wide format while also running an aggregation. From the following data, we might want to find each of the average value in sales for each product across ALL stores. We then want to pivot those averages such that we end up with a row-index of product and a column-index of months where each value corresponds to the average sales for that product in that month.

from pandas import DataFrame, MultiIndex, CategoricalDtype
from calendar import month_abbr
from numpy.random import default_rng

MonthDtype = CategoricalDtype([m for m in month_abbr[1:]], ordered=True)
rng = default_rng(0)

index = MultiIndex.from_product(
    [[*'ABCD'], ['XYZ'], month_abbr[1:5]], names=['product', 'store', 'month']
)

df = (
    DataFrame(index=index, data={'sales': rng.normal(10, 3, size=len(index))})
    .reset_index()
    .astype({'month': MonthDtype})
)

df.head(6)
product store month sales
0 A XYZ Jan 10.377191
1 A XYZ Feb 9.603685
2 A XYZ Mar 11.921268
3 A XYZ Apr 10.314700
4 B XYZ Jan 8.392992
5 B XYZ Feb 11.084785

Without using .pivot_table, we would reach for a .groupby()….unstack() operation. We can use our understanding of the index that is formed as a result of a groupby operation to then reshape our values as needed.

(
    df.groupby(['product', 'month'], observed=True)['sales']
    .mean()
    .unstack()
)
month Jan Feb Mar Apr
product
A 10.377191 9.603685 11.921268 10.314700
B 8.392992 11.084785 13.912000 12.841243
C 7.888794 6.203736 8.130177 10.123978
D 3.024908 9.343625 6.262267 7.803198

Considering this transformation and display is so common, pandas has a convenience method for it: .pivot_table.

df.pivot_table(
    index='product', # resultant row-index
    columns='month', # resultant column-index
    values='sales',  # what values to aggregate
    aggfunc='mean',  # how to aggregate
    observed=True
)
month Jan Feb Mar Apr
product
A 10.377191 9.603685 11.921268 10.314700
B 8.392992 11.084785 13.912000 12.841243
C 7.888794 6.203736 8.130177 10.123978
D 3.024908 9.343625 6.262267 7.803198

The .pivot_table is fairly straightforward and is useful for some simple conditions. I find the previously seen .groupby()….unstack() approach to be far more flexible though, so if you have to perform any complex aggregation or need to view differently aggregated values you should probably stick with this pattern!

lreshape

pandas.lreshape is one of the most overlooked reshaping functions that pandas offers. In fact it was marked for deprecation for quite some time before reversing that decision! This function lets you take multiple

from pandas import DataFrame

df = DataFrame({
    'product': [*'ABCD'],
    'jan_sales1'  : [0, 11, 12, 13],
    'jan_sales2'  : [0,  1,  2,  3],
    'feb_sales1'  : [14, 15, 16, 17],
    'feb_sales2'  : [ 4,  5,  6,  7],
    'mar_sales1'  : [18, 19, 0, 0],
    'mar_sales2'  : [8 ,  9, 0, 0],
})

df
product jan_sales1 jan_sales2 feb_sales1 feb_sales2 mar_sales1 mar_sales2
0 A 0 0 14 4 18 8
1 B 11 1 15 5 19 9
2 C 12 2 16 6 0 0
3 D 13 3 17 7 0 0
from pandas import lreshape

groups = {
    'jan': ['jan_sales1', 'jan_sales2'],
    'feb': ['feb_sales1', 'feb_sales2'],
    'mar': ['mar_sales1', 'mar_sales2']
}
lreshape(df, groups)
product jan feb mar
0 A 0 14 18
1 B 11 15 19
2 C 12 16 0
3 D 13 17 0
4 A 0 4 8
5 B 1 5 9
6 C 2 6 0
7 D 3 7 0

This is very useful if you need to perform multiple melt-like operations all at once.

wide_to_long

This is similar to lreshape, but instead of supplying generic groupings in a dictionary format, we can use regular expression patterns to define what groups need to be combined together. Using this function is fairly straightforward: you provide a pattern to identify related columns, and it stacks them together while keeping your identifiers intact. This is especially useful for time series data or repeated measurements, where consistent formatting is key.

from pandas import wide_to_long

wide_to_long(
    df, 
    stubnames=['jan', 'feb', 'mar'], # column prefixes to search for
    sep='_sales', # what separates the column prefix (stubname) from the suffix
    suffix=r'\d+',  # extract the last digit to act as an identifier
    i='product',  # column that acts as index
    j='time',     # stacked identifier output name (the part that suffix matches)
)
jan feb mar
product time
A 1 0 14 18
B 1 11 15 19
C 1 12 16 0
D 1 13 17 0
A 2 0 4 8
B 2 1 5 9
C 2 2 6 0
D 2 3 7 0

The trick to using the above function is to understand what each of the arguments lets you manipulate. But I would also recommend that this reshaping doesn't necessarily change the data into a fully long format. If we want this to be truly long, we will need to do an additional stack- or reach for our more generic pandas syntax.

As I mentioned earlier, if you can think in terms of the index, you can express these types of operations with a large degree of flexibility.

(
    df.set_index('product')
    .stack().rename('sales')
    .reset_index()
    .pipe(
        lambda d: 
            d['level_1'].str.extract(r'(?P<month>\w+)_sales(?P<time>\d+)')
            .astype({'time': int})
            .join(d)
    )
    .drop(columns=['level_1'])
).head(6)
month time product sales
0 jan 1 A 0
1 jan 2 A 0
2 feb 1 A 14
3 feb 2 A 4
4 mar 1 A 18
5 mar 2 A 8

Wrap-Up

And there you have it! My favorite overlooked reshaping methods in pandas. While not much will beat out our .stack() and .unstack() operations in terms of flexibility, these tools help us express complex transformations with some easy-to-understand features that abstract away a lot of the mechanical details.

What do you think about these pandas functions? Will you use them the same now? Let me know on the DUTC Discord server.

Table of Contents
Table of Contents