pandas reshaping functions you didn't know about
In a previous blog post, Wide vs. Long Data in pandas — an Introduction, I introduced the topic of reshaping your tabular data from wide to long format using the index-aware .stack() & .unstack() or the data-oriented .melt() and .pivot(). However, pandas has a couple of convenient reshaping methods that you should know about that simplify your life more than those aforementioned methods.
Today, we'll cover...
pandas.pivot_table a simplification for transforming long → wide + aggregation.
pandas.lreshape a generic wide → long transform.
pandas.wide_to_long a patterened wide → long transform.
pivot_table
The pandas .pivot_table enables us to easily transform our data from a long to wide format while also running an aggregation. From the following data, we might want to find each of the average value in sales for each product across ALL stores. We then want to pivot those averages such that we end up with a row-index of product and a column-index of months where each value corresponds to the average sales for that product in that month.
from pandas import DataFrame, MultiIndex, CategoricalDtype
from calendar import month_abbr
from numpy.random import default_rng
MonthDtype = CategoricalDtype([m for m in month_abbr[1:]], ordered=True)
rng = default_rng(0)
index = MultiIndex.from_product(
[[*'ABCD'], ['XYZ'], month_abbr[1:5]], names=['product', 'store', 'month']
)
df = (
DataFrame(index=index, data={'sales': rng.normal(10, 3, size=len(index))})
.reset_index()
.astype({'month': MonthDtype})
)
df.head(6)
product | store | month | sales | |
---|---|---|---|---|
0 | A | XYZ | Jan | 10.377191 |
1 | A | XYZ | Feb | 9.603685 |
2 | A | XYZ | Mar | 11.921268 |
3 | A | XYZ | Apr | 10.314700 |
4 | B | XYZ | Jan | 8.392992 |
5 | B | XYZ | Feb | 11.084785 |
Without using .pivot_table, we would reach for a .groupby()….unstack() operation. We can use our understanding of the index that is formed as a result of a groupby operation to then reshape our values as needed.
(
df.groupby(['product', 'month'], observed=True)['sales']
.mean()
.unstack()
)
month | Jan | Feb | Mar | Apr |
---|---|---|---|---|
product | ||||
A | 10.377191 | 9.603685 | 11.921268 | 10.314700 |
B | 8.392992 | 11.084785 | 13.912000 | 12.841243 |
C | 7.888794 | 6.203736 | 8.130177 | 10.123978 |
D | 3.024908 | 9.343625 | 6.262267 | 7.803198 |
Considering this transformation and display is so common, pandas has a convenience method for it: .pivot_table.
df.pivot_table(
index='product', # resultant row-index
columns='month', # resultant column-index
values='sales', # what values to aggregate
aggfunc='mean', # how to aggregate
observed=True
)
month | Jan | Feb | Mar | Apr |
---|---|---|---|---|
product | ||||
A | 10.377191 | 9.603685 | 11.921268 | 10.314700 |
B | 8.392992 | 11.084785 | 13.912000 | 12.841243 |
C | 7.888794 | 6.203736 | 8.130177 | 10.123978 |
D | 3.024908 | 9.343625 | 6.262267 | 7.803198 |
The .pivot_table is fairly straightforward and is useful for some simple conditions. I find the previously seen .groupby()….unstack() approach to be far more flexible though, so if you have to perform any complex aggregation or need to view differently aggregated values you should probably stick with this pattern!
lreshape
pandas.lreshape is one of the most overlooked reshaping functions that pandas offers. In fact it was marked for deprecation for quite some time before reversing that decision! This function lets you take multiple
from pandas import DataFrame
df = DataFrame({
'product': [*'ABCD'],
'jan_sales1' : [0, 11, 12, 13],
'jan_sales2' : [0, 1, 2, 3],
'feb_sales1' : [14, 15, 16, 17],
'feb_sales2' : [ 4, 5, 6, 7],
'mar_sales1' : [18, 19, 0, 0],
'mar_sales2' : [8 , 9, 0, 0],
})
df
product | jan_sales1 | jan_sales2 | feb_sales1 | feb_sales2 | mar_sales1 | mar_sales2 | |
---|---|---|---|---|---|---|---|
0 | A | 0 | 0 | 14 | 4 | 18 | 8 |
1 | B | 11 | 1 | 15 | 5 | 19 | 9 |
2 | C | 12 | 2 | 16 | 6 | 0 | 0 |
3 | D | 13 | 3 | 17 | 7 | 0 | 0 |
from pandas import lreshape
groups = {
'jan': ['jan_sales1', 'jan_sales2'],
'feb': ['feb_sales1', 'feb_sales2'],
'mar': ['mar_sales1', 'mar_sales2']
}
lreshape(df, groups)
product | jan | feb | mar | |
---|---|---|---|---|
0 | A | 0 | 14 | 18 |
1 | B | 11 | 15 | 19 |
2 | C | 12 | 16 | 0 |
3 | D | 13 | 17 | 0 |
4 | A | 0 | 4 | 8 |
5 | B | 1 | 5 | 9 |
6 | C | 2 | 6 | 0 |
7 | D | 3 | 7 | 0 |
This is very useful if you need to perform multiple melt-like operations all at once.
wide_to_long
This is similar to lreshape, but instead of supplying generic groupings in a dictionary format, we can use regular expression patterns to define what groups need to be combined together. Using this function is fairly straightforward: you provide a pattern to identify related columns, and it stacks them together while keeping your identifiers intact. This is especially useful for time series data or repeated measurements, where consistent formatting is key.
from pandas import wide_to_long
wide_to_long(
df,
stubnames=['jan', 'feb', 'mar'], # column prefixes to search for
sep='_sales', # what separates the column prefix (stubname) from the suffix
suffix=r'\d+', # extract the last digit to act as an identifier
i='product', # column that acts as index
j='time', # stacked identifier output name (the part that suffix matches)
)
jan | feb | mar | ||
---|---|---|---|---|
product | time | |||
A | 1 | 0 | 14 | 18 |
B | 1 | 11 | 15 | 19 |
C | 1 | 12 | 16 | 0 |
D | 1 | 13 | 17 | 0 |
A | 2 | 0 | 4 | 8 |
B | 2 | 1 | 5 | 9 |
C | 2 | 2 | 6 | 0 |
D | 2 | 3 | 7 | 0 |
The trick to using the above function is to understand what each of the arguments lets you manipulate. But I would also recommend that this reshaping doesn't necessarily change the data into a fully long format. If we want this to be truly long, we will need to do an additional stack- or reach for our more generic pandas syntax.
As I mentioned earlier, if you can think in terms of the index, you can express these types of operations with a large degree of flexibility.
(
df.set_index('product')
.stack().rename('sales')
.reset_index()
.pipe(
lambda d:
d['level_1'].str.extract(r'(?P<month>\w+)_sales(?P<time>\d+)')
.astype({'time': int})
.join(d)
)
.drop(columns=['level_1'])
).head(6)
month | time | product | sales | |
---|---|---|---|---|
0 | jan | 1 | A | 0 |
1 | jan | 2 | A | 0 |
2 | feb | 1 | A | 14 |
3 | feb | 2 | A | 4 |
4 | mar | 1 | A | 18 |
5 | mar | 2 | A | 8 |
Wrap-Up
And there you have it! My favorite overlooked reshaping methods in pandas. While not much will beat out our .stack() and .unstack() operations in terms of flexibility, these tools help us express complex transformations with some easy-to-understand features that abstract away a lot of the mechanical details.
What do you think about these pandas functions? Will you use them the same now? Let me know on the DUTC Discord server.