Wide vs. Long Data in pandas — an Introduction#
Tabular data is stored in a variety of formats, shapes, and sizes, depending on the needs of the analysis or the limitations of the tools at hand. Two common structures for tabular data are wide and long formats. While both represent the same underlying information, their differences lie in how variables and observations are organized, and choosing the right format can make or break your workflow.
In wide format, data is spread across columns, with each variable (or time point, category, etc.) occupying its own column. This structure is often easier for human readability when comparing across categories (you likely organize your Excel spreadsheets in this format) but can be limiting for advanced data processing and visualization.
In contrast, long format arranges data so that each observation has its own row, with a single column used to categorize variables. This “tidy” structure is the preferred choice for most modern analytical tools, especially when grouping, filtering, or reshaping data.
Understanding when to use wide vs. long format—and how to efficiently transition between the two is an essential skill for any data analyst. So let’s dive into some pandas code to highlight the key differences, practical examples, and scenarios where each format shines.
Wide Data#
from pandas import DataFrame, NA
df = DataFrame({
'product': [*'ABCD'],
'jan' : [NA, 11, 12, 13],
'feb' : [14, 15, 16, 17],
'mar' : [18, 19, NA, NA],
}).convert_dtypes()
df
product | jan | feb | mar | |
---|---|---|---|---|
0 | A | <NA> | 14 | 18 |
1 | B | 11 | 15 | 19 |
2 | C | 12 | 16 | <NA> |
3 | D | 13 | 17 | <NA> |
We know that the table above is wide because its…
Rows represent entities: Each row corresponds to a specific product (e.g., A, B, C, D).
Columns represent levels of a singular or multiple (nested) variables:
The product column identifies the product being tracked.
The remaining columns (jan, feb, and mar) represent sales for specific months. These “variables” all belong to a superset of “time,” meaning that these columns are simply different levels of a broader category.
Values span multiple columns: The sales data for different months are stored in separate columns, making it easier to see all months’ sales for a product at a glance.
The benefit of wide-formatted data is that it is typically straightforward to glance insight from the data by simply using your eyes. I can easily compare the sales from adjacent months within any given product. While this feels like an okay way to model this data, we often see nesting of these attributes. What if each product doesn’t just have sales for each month, but also tracks the number of units sold, along with the number of clicks each product received on a website? Furthermore, what if I obtain more data from June, July, and August? Either of these extensions would force us to add more columns to our table, which causes it to grow wide.
from pandas import DataFrame
df = DataFrame({
'product': [*'ABCD'],
'jan_sales' : [NA, 11, 12, 13],
'jan_units' : [NA, 1, 2, 3],
'jan_clicks' : [NA, 71, 72, 73],
'feb_sales' : [14, 15, 16, 17],
'feb_units' : [ 4, 5, 6, 7],
'feb_clicks' : [74, 75, 6, 77],
'mar_sales' : [18, 19, NA, NA],
'mar_units' : [8 , 9, NA, NA],
'mar_clicks' : [78, 79, NA, NA],
})
df
product | jan_sales | jan_units | jan_clicks | feb_sales | feb_units | feb_clicks | mar_sales | mar_units | mar_clicks | |
---|---|---|---|---|---|---|---|---|---|---|
0 | A | <NA> | <NA> | <NA> | 14 | 4 | 74 | 18 | 8 | 78 |
1 | B | 11 | 1 | 71 | 15 | 5 | 75 | 19 | 9 | 79 |
2 | C | 12 | 2 | 72 | 16 | 6 | 6 | <NA> | <NA> | <NA> |
3 | D | 13 | 3 | 73 | 17 | 7 | 77 | <NA> | <NA> | <NA> |
Long Data#
from pandas import DataFrame, NA
from pandas import CategoricalDtype
from calendar import month_abbr
MonthDtype = CategoricalDtype([m.lower() for m in month_abbr[1:]], ordered=True)
df = (
DataFrame({
'product': [*'AABBBCCDD'],
'month' : [*'fmjfmjfjf'],
'sales' : [*range(11, 11+9)],
})
.assign( # replace 'j' → 'jan', etc.
month=lambda d:
d['month'].map({'j': 'jan', 'f': 'feb', 'm': 'mar'})
)
.astype({'month': MonthDtype})
.sort_values(['month', 'product'])
)
df
product | month | sales | |
---|---|---|---|
2 | B | jan | 13 |
5 | C | jan | 16 |
7 | D | jan | 18 |
0 | A | feb | 11 |
3 | B | feb | 14 |
6 | C | feb | 17 |
8 | D | feb | 19 |
1 | A | mar | 12 |
4 | B | mar | 15 |
We know these data are long because:
Rows represent observations: Each row corresponds to a single observation, which in this case is the sales figure for a specific product in a specific month.
Columns define unique variables (not levels of a singular variable):
The product column identifies the product being tracked (e.g., A, B, C, D).
The month column specifies the time period of the observation (jan, feb, and mar).
The sales column provides the measured value for the observation (e.g., sales amount).
If we need to add a new attribute (like units or clicks), we would create new columns. However, if we need to add more observations, we add rows; the addition of new rows for new observations is what helps us define this table as long.
While long format is considered more “analytically ready” and wide format is regarded as more “report ready,” it is important to know how to transform your data between the two shapes.
Reshaping#
Often we are not in control of data generative processes. Data can come to us in either wide or long format, with different column names, or with inconsistent spacing, and the cleaning process will often cause more headache than the actual analysis.
To best differentiate the formats, I will use the below code to highlight given
portions of my pandas.DataFrames
. Understanding this code is not necessary,
but I left it here if you ever wanted to reference it.
def highlight_index(index, mapping={}):
return index.map({
k: f'background-color: {v}; color: black;'
for k, v in mapping.items()
}).fillna('')
def highlight_columns(df, mapping):
def _highlight_col(s):
color = mapping.get(s.name, '')
return [f"background-color: {color}; color: black"] * len(df)
return (
df.style.apply(_highlight_col, subset=[*mapping.keys()])
.apply_index(highlight_index, axis=1, mapping=mapping)
)
def highlight_rows(df, mapping, subset=[]):
result = None
mapping = {
k: f'background-color: {v}; color: black'
for k, v in mapping.items()
}
def _highlight_col(s):
nonlocal result
if result is not None:
return result
result = s.map(mapping).fillna('')
return result
return (
df.style.apply(_highlight_col, subset=subset)
)
melt (wide to long)#
Given our original data, how do we transform it from this wide format into a long
one? In pandas, we pandas.DataFrame.melt
which is also known as an unpivot
in some other tabular data analysis tools.
The term “melt” originates from its metaphorical resemblance to melting a solid object into a liquid, where the rigid, structured format (wide data with fixed columns) is “melted” into a more fluid and flexible structure (long data with fewer, generalized columns). The melt
method has us specify which current columns
are to be used as identifiers id_vars
, what to name the newly melted column var_name
,
and what to name the melted values value_name
.
from pandas import DataFrame
df = DataFrame({
'product': [*'ABCD'],
'jan' : [NA, 11, 12, 13],
'feb' : [14, 15, 16, 17],
'mar' : [18, 19, NA, NA],
}).convert_dtypes()
color_mapping = {'jan': '#66D9EF', 'feb': '#FF6F61', 'mar': '#FFF7E5'}
display(
df.pipe(highlight_columns, mapping=color_mapping),
df.melt(id_vars='product', var_name='month', value_name='sales')
.pipe(highlight_rows, mapping=color_mapping, subset=['month', 'sales'])
)
product | jan | feb | mar | |
---|---|---|---|---|
0 | A | 14 | 18 | |
1 | B | 11 | 15 | 19 |
2 | C | 12 | 16 | |
3 | D | 13 | 17 |
product | month | sales | |
---|---|---|---|
0 | A | jan | |
1 | B | jan | 11 |
2 | C | jan | 12 |
3 | D | jan | 13 |
4 | A | feb | 14 |
5 | B | feb | 15 |
6 | C | feb | 16 |
7 | D | feb | 17 |
8 | A | mar | 18 |
9 | B | mar | 19 |
10 | C | mar | |
11 | D | mar |
Notice how, in our long format, we have transformed our column names, 'jan', 'feb', 'mar'
,
into their own singular column titled 'month'
. At the same time, we took the values
from those original columns and inserted them into a new column titled 'sales'
.
This format provides ample metadata so that we know what each value represents and corresponds to without needing additional guidance (e.g., if you received the wide-format data, your first question would be, “What do these numbers represent?”).
pivot (long to wide)#
The inverse transform is commonly called a pivot
and is accomplished with pandas.DataFrame.pivot.
The pandas.DataFrame.pivot
function reshapes data from long to wide format by
reorganizing rows into columns based on unique combinations of specified keys.
Key arguments include index (column(s) to use as the new row index), columns
(column to use for creating new column labels), and values
(column to populate the new table with).
from pandas import DataFrame, NA
from pandas import CategoricalDtype
from calendar import month_abbr
MonthDtype = CategoricalDtype([m.lower() for m in month_abbr[1:]], ordered=True)
df = (
DataFrame({
'product': [*'AABBBCCDD'],
'month' : [*'fmjfmjfjf'],
'sales' : [*range(11, 11+9)],
})
.assign( # replace 'j' → 'jan', etc.
month=lambda d: d['month'].map({'j': 'jan', 'f': 'feb', 'm': 'mar'})
)
.astype({'month': MonthDtype, 'sales': 'Int64'})
.sort_values(['month', 'product'])
.reset_index(drop=True)
)
display(
df.pipe(highlight_rows, mapping=color_mapping, subset=['month', 'sales']),
df.pivot(index='product', columns='month', values='sales')
.rename_axis(columns=None).reset_index()
.pipe(highlight_columns, mapping=color_mapping)
)
product | month | sales | |
---|---|---|---|
0 | B | jan | 13 |
1 | C | jan | 16 |
2 | D | jan | 18 |
3 | A | feb | 11 |
4 | B | feb | 14 |
5 | C | feb | 17 |
6 | D | feb | 19 |
7 | A | mar | 12 |
8 | B | mar | 15 |
product | jan | feb | mar | |
---|---|---|---|---|
0 | A | 11 | 12 | |
1 | B | 13 | 14 | 15 |
2 | C | 16 | 17 | |
3 | D | 18 | 19 |
Wrap-Up#
And that’s all the time we have this week! There is a lot more to discuss on this topic, so I hope that this introduction can help you navigate your own data formats and select the right one for your use case. Keep an eye out for future blog posts where I’ll discuss more advanced reshaping techniques in pandas!
What do you think about long and wide data formats? Do you encounter them in your work? Let me know on the DUTC Discord server.
Talk to you all next week!