pandas.concat, explained.#

Hello, everyone! Welcome back to Cameron’s Corner. This week, I want to tackle a pandas question I received concerning the different ways to combine pandas.DataFrames. Today, I’ll focus on pandas.concat, since we have covered DataFrame merges quite thoroughly in previous weeks. Specifically, we’ll take a look at DataFrame Inequality Joins, DataFrame Joins & Sets and DataFrame Joins & MultiSets.

In pandas, we have three explicit ways to combine DataFrames:

  • pandas.merge joins two DataFrames horizontally on a specific column (or optionally, the index).

  • pandas.join joins two DataFrames horizontally on their indexes.

  • pandas.concat joins many DataFrames either horizontally or vertically according to their row/column index.

The thing that sets pandas.concat apart from its competition is the ability to combine two or more DataFrames, whereas pandas.merge—and by extension pandas.join (since it is just a wrapper around pandas.merge)—can only handle aligning two DataFrames.

Let’s take a look at the flexibility we have when working with this function.

The Data#

Let’s create three DataFrames that share some row-index and column-index values. Do be careful to note where these row/column-index values are equivalent. I have spaced out the code so that you can easily see where these entities will align.

from numpy import arange
import pandas as pd

data = arange(0, 18).reshape(3, 3, 2)

# take note where these indexes and columns overlap and where they do not
dfs = {
    'df1': pd.DataFrame(data[0], index=['a', 'b', 'c'          ], columns=['x', 'y'     ]),
    'df2': pd.DataFrame(data[1], index=[     'b', 'c', 'd'     ], columns=['x',      'z']),
    'df3': pd.DataFrame(data[2], index=[     'b',      'd', 'e'], columns=['x',      'z']),
}

for k, df in dfs.items():
    display(k, df)
'df1'
x y
a 0 1
b 2 3
c 4 5
'df2'
x z
b 6 7
c 8 9
d 10 11
'df3'
x z
b 12 13
d 14 15
e 16 17

Given the above three frames, let’s explore the major ways we can combine them using pandas.concat.

Concat Options#

When combining multiple DataFrames, we have two decisions to make:

  • Should we combine horizontally or vertically?

  • Should we take the intersection (inner) or the union (outer) of the index along that direction?

These two questions are controlled by the axis= and join= arguments in pandas.concat. Specifying axis='rows' means that we will stack the DataFrames vertically. Since we are vertically stacking our DataFrames, we will assume that we may only want to pull a subset of the columns. If we want our result to contain ALL of the columns across all of the inputted DataFrames, we can specify join='outer'. On the other hand, if we only want columns that are shared across all of the DataFrames (i.e., the ones in common), then we would use join='inner'.

Remember that axis= indicates the direction we are stacking the DataFrames, and join= controls what rows/columns we consider while performing this stacking operation. When axis='rows', the join= exerts control on the resultant columns. Whereas if axis='columns' (stack horizontally), then join= exerts control on the resultant rows.

Let’s take a look at this in action:

# align vertically, preserving fully overlapping columns indices
pd.concat(dfs, axis='rows', join='outer')
x y z
df1 a 0 1.0 NaN
b 2 3.0 NaN
c 4 5.0 NaN
df2 b 6 NaN 7.0
c 8 NaN 9.0
d 10 NaN 11.0
df3 b 12 NaN 13.0
d 14 NaN 15.0
e 16 NaN 17.0
# align vertically, preserving ONLY shared columns indices
pd.concat(dfs, axis='rows', join='inner')
x
df1 a 0
b 2
c 4
df2 b 6
c 8
d 10
df3 b 12
d 14
e 16
# align horizontally, preserving all row indices
pd.concat(dfs, axis='columns', join='outer')
df1 df2 df3
x y x z x z
a 0.0 1.0 NaN NaN NaN NaN
b 2.0 3.0 6.0 7.0 12.0 13.0
c 4.0 5.0 8.0 9.0 NaN NaN
d NaN NaN 10.0 11.0 14.0 15.0
e NaN NaN NaN NaN 16.0 17.0
# align horizontally, preserving ONLY shared row indices
pd.concat(dfs, axis='columns', join='inner')
df1 df2 df3
x y x z x z
b 2 3 6 7 12 13

It’s a pretty flexible tool, right? Now that we know what types of problems pandas.concat solves for us, let’s talk about how this operation is carried out.

The Mechanics#

pandas operations are index-aware, meaning that most operations will inspect the index that has been set (on either the rows, columns, or both) and operate with the knowledge that is contained within the index.

Concatenation is not an exception to this rule. In fact, we can recreate the behavior of pandas.concat through the DataFrame.reindex method. All we need to do is create the shared index (e.g., the inner [intersection] or outer [union] of all indexes in consideration) and align our frames to that!

Let’s create some pandas.Series objects to concatenate, again highlighting where their .index line up with one another.

from functools import reduce
import pandas as pd

parts = [
    #                    highlighting the alignments
    #                            a    b    c    d    e
    pd.Series([0, 1, 2], index=['a', 'b', 'c'          ], dtype='Int64'),
    pd.Series([3, 4, 5], index=[     'b', 'c', 'd'     ], dtype='Int64'),
    pd.Series([6, 7, 8], index=[     'b',      'd', 'e'], dtype='Int64'),
]
indices = [s.index for s in parts]

If we take the union/intersection of each of these index objects, noting that the outer variable contains an index whose values are the union of ALL values, and the inner variable only contains values that are the intersection of the index values…

display(
    (outer := reduce(lambda l, r: l.union(r),        indices)), # outer
    (inner := reduce(lambda l, r: l.intersection(r), indices)), # inner
)
Index(['a', 'b', 'c', 'd', 'e'], dtype='object')
Index(['b'], dtype='object')

…then we can use these new index objects to .reindex on each of our Series objects to combine them into a single output!

# outer aligned
pd.DataFrame({i: p.reindex(outer) for i, p in enumerate(parts)})
0 1 2
a 0 <NA> <NA>
b 1 3 6
c 2 4 <NA>
d <NA> 5 7
e <NA> <NA> 8
# inner aligned
pd.DataFrame({i: p.reindex(inner) for i, p in enumerate(parts)})
0 1 2
b 1 3 6

And that’s our conceptual recreation of pandas.concat just using our knowledge of how the index works!

Wrap-Up#

That’s all we have time for this week! What do you think about my approach? Anything you’d do differently? Something not making sense? Let me know on the DUTC Discord server.

Make sure you tune back in next week for a fresh post on Python, pandas, Matplotlib, and more. Talk to you all then.