pandas.concat, explained.#
Hello, everyone! Welcome back to Cameron’s Corner. This week, I want to tackle
a pandas question I received concerning the different ways to combine pandas.DataFrames
.
Today, I’ll focus on pandas.concat
, since we have
covered DataFrame merges quite thoroughly in previous weeks. Specifically, we’ll take a look at
DataFrame Inequality Joins, DataFrame Joins & Sets and DataFrame Joins & MultiSets.
In pandas, we have three explicit ways to combine DataFrames:
pandas.merge
joins two DataFrames horizontally on a specific column (or optionally, the index).pandas.join
joins two DataFrames horizontally on their indexes.pandas.concat
joins many DataFrames either horizontally or vertically according to their row/column index.
The thing that sets pandas.concat
apart from its competition is the ability
to combine two or more DataFrames, whereas pandas.merge
—and by extension pandas.join
(since it is just a wrapper around pandas.merge
)—can only handle aligning two DataFrames.
Let’s take a look at the flexibility we have when working with this function.
The Data#
Let’s create three DataFrames that share some row-index and column-index values. Do be careful to note where these row/column-index values are equivalent. I have spaced out the code so that you can easily see where these entities will align.
from numpy import arange
import pandas as pd
data = arange(0, 18).reshape(3, 3, 2)
# take note where these indexes and columns overlap and where they do not
dfs = {
'df1': pd.DataFrame(data[0], index=['a', 'b', 'c' ], columns=['x', 'y' ]),
'df2': pd.DataFrame(data[1], index=[ 'b', 'c', 'd' ], columns=['x', 'z']),
'df3': pd.DataFrame(data[2], index=[ 'b', 'd', 'e'], columns=['x', 'z']),
}
for k, df in dfs.items():
display(k, df)
'df1'
x | y | |
---|---|---|
a | 0 | 1 |
b | 2 | 3 |
c | 4 | 5 |
'df2'
x | z | |
---|---|---|
b | 6 | 7 |
c | 8 | 9 |
d | 10 | 11 |
'df3'
x | z | |
---|---|---|
b | 12 | 13 |
d | 14 | 15 |
e | 16 | 17 |
Given the above three frames, let’s explore the major ways we can combine them using
pandas.concat
.
Concat Options#
When combining multiple DataFrames, we have two decisions to make:
Should we combine horizontally or vertically?
Should we take the intersection (inner) or the union (outer) of the index along that direction?
These two questions are controlled by the axis=
and join=
arguments in pandas.concat
.
Specifying axis='rows'
means that we will stack the DataFrames vertically. Since
we are vertically stacking our DataFrames, we will assume that we may only want
to pull a subset of the columns. If we want our result to contain ALL of the columns
across all of the inputted DataFrames, we can specify join='outer'
. On the other hand,
if we only want columns that are shared across all of the DataFrames (i.e., the ones
in common), then we would use join='inner'
.
Remember that axis=
indicates the direction we are stacking the DataFrames, and
join=
controls what rows/columns we consider while performing this stacking operation.
When axis='rows'
, the join=
exerts control on the resultant columns. Whereas
if axis='columns'
(stack horizontally), then join=
exerts control on the
resultant rows.
Let’s take a look at this in action:
# align vertically, preserving fully overlapping columns indices
pd.concat(dfs, axis='rows', join='outer')
x | y | z | ||
---|---|---|---|---|
df1 | a | 0 | 1.0 | NaN |
b | 2 | 3.0 | NaN | |
c | 4 | 5.0 | NaN | |
df2 | b | 6 | NaN | 7.0 |
c | 8 | NaN | 9.0 | |
d | 10 | NaN | 11.0 | |
df3 | b | 12 | NaN | 13.0 |
d | 14 | NaN | 15.0 | |
e | 16 | NaN | 17.0 |
# align vertically, preserving ONLY shared columns indices
pd.concat(dfs, axis='rows', join='inner')
x | ||
---|---|---|
df1 | a | 0 |
b | 2 | |
c | 4 | |
df2 | b | 6 |
c | 8 | |
d | 10 | |
df3 | b | 12 |
d | 14 | |
e | 16 |
# align horizontally, preserving all row indices
pd.concat(dfs, axis='columns', join='outer')
df1 | df2 | df3 | ||||
---|---|---|---|---|---|---|
x | y | x | z | x | z | |
a | 0.0 | 1.0 | NaN | NaN | NaN | NaN |
b | 2.0 | 3.0 | 6.0 | 7.0 | 12.0 | 13.0 |
c | 4.0 | 5.0 | 8.0 | 9.0 | NaN | NaN |
d | NaN | NaN | 10.0 | 11.0 | 14.0 | 15.0 |
e | NaN | NaN | NaN | NaN | 16.0 | 17.0 |
# align horizontally, preserving ONLY shared row indices
pd.concat(dfs, axis='columns', join='inner')
df1 | df2 | df3 | ||||
---|---|---|---|---|---|---|
x | y | x | z | x | z | |
b | 2 | 3 | 6 | 7 | 12 | 13 |
It’s a pretty flexible tool, right? Now that we know what types of problems
pandas.concat
solves for us, let’s talk about how this operation is carried
out.
The Mechanics#
pandas operations are index-aware, meaning that most operations will inspect the index that has been set (on either the rows, columns, or both) and operate with the knowledge that is contained within the index.
Concatenation is not an exception to this rule. In fact, we can recreate the behavior of pandas.concat
through the DataFrame.reindex
method.
All we need to do is create the shared index (e.g., the inner [intersection] or outer [union]
of all indexes in consideration) and align our frames to that!
Let’s create some pandas.Series
objects to concatenate, again highlighting where
their .index
line up with one another.
from functools import reduce
import pandas as pd
parts = [
# highlighting the alignments
# a b c d e
pd.Series([0, 1, 2], index=['a', 'b', 'c' ], dtype='Int64'),
pd.Series([3, 4, 5], index=[ 'b', 'c', 'd' ], dtype='Int64'),
pd.Series([6, 7, 8], index=[ 'b', 'd', 'e'], dtype='Int64'),
]
indices = [s.index for s in parts]
If we take the union/intersection of each of these index objects,
noting that the outer
variable contains an index whose values are the union of
ALL values, and the inner
variable only contains values that are the intersection
of the index values…
display(
(outer := reduce(lambda l, r: l.union(r), indices)), # outer
(inner := reduce(lambda l, r: l.intersection(r), indices)), # inner
)
Index(['a', 'b', 'c', 'd', 'e'], dtype='object')
Index(['b'], dtype='object')
…then we can use these new index objects to .reindex
on each of our Series objects
to combine them into a single output!
# outer aligned
pd.DataFrame({i: p.reindex(outer) for i, p in enumerate(parts)})
0 | 1 | 2 | |
---|---|---|---|
a | 0 | <NA> | <NA> |
b | 1 | 3 | 6 |
c | 2 | 4 | <NA> |
d | <NA> | 5 | 7 |
e | <NA> | <NA> | 8 |
# inner aligned
pd.DataFrame({i: p.reindex(inner) for i, p in enumerate(parts)})
0 | 1 | 2 | |
---|---|---|---|
b | 1 | 3 | 6 |
And that’s our conceptual recreation of pandas.concat
just using our knowledge
of how the index works!
Wrap-Up#
That’s all we have time for this week! What do you think about my approach? Anything you’d do differently? Something not making sense? Let me know on the DUTC Discord server.
Make sure you tune back in next week for a fresh post on Python, pandas, Matplotlib, and more. Talk to you all then.