# pandas.concat, explained.#

Hello, everyone! Welcome back to Cameron’s Corner. This week, I want to tackle
a pandas question I received concerning the different ways to combine `pandas.DataFrames`

.
Today, I’ll focus on `pandas.concat`

, since we have
covered DataFrame merges quite thoroughly in previous weeks. Specifically, we’ll take a look at
DataFrame Inequality Joins, DataFrame Joins & Sets and DataFrame Joins & MultiSets.

In pandas, we have three explicit ways to combine DataFrames:

`pandas.merge`

joins two DataFrames horizontally on a specific column (or optionally, the index).`pandas.join`

joins two DataFrames horizontally on their indexes.`pandas.concat`

joins many DataFrames either horizontally or vertically according to their row/column index.

The thing that sets `pandas.concat`

apart from its competition is the ability
to combine two **or more** DataFrames, whereas `pandas.merge`

—and by extension `pandas.join`

(since it is just a wrapper around `pandas.merge`

)—can only handle aligning two DataFrames.

Let’s take a look at the flexibility we have when working with this function.

## The Data#

Let’s create three DataFrames that share **some** row-index and column-index values.
Do be careful to note where these row/column-index values are equivalent. I have spaced out the code so that you can easily see
where these entities will align.

```
from numpy import arange
import pandas as pd
data = arange(0, 18).reshape(3, 3, 2)
# take note where these indexes and columns overlap and where they do not
dfs = {
'df1': pd.DataFrame(data[0], index=['a', 'b', 'c' ], columns=['x', 'y' ]),
'df2': pd.DataFrame(data[1], index=[ 'b', 'c', 'd' ], columns=['x', 'z']),
'df3': pd.DataFrame(data[2], index=[ 'b', 'd', 'e'], columns=['x', 'z']),
}
for k, df in dfs.items():
display(k, df)
```

```
'df1'
```

x | y | |
---|---|---|

a | 0 | 1 |

b | 2 | 3 |

c | 4 | 5 |

```
'df2'
```

x | z | |
---|---|---|

b | 6 | 7 |

c | 8 | 9 |

d | 10 | 11 |

```
'df3'
```

x | z | |
---|---|---|

b | 12 | 13 |

d | 14 | 15 |

e | 16 | 17 |

Given the above three frames, let’s explore the major ways we can combine them using
`pandas.concat`

.

## Concat Options#

When combining multiple DataFrames, we have two decisions to make:

Should we combine horizontally or vertically?

Should we take the intersection (inner) or the union (outer) of the index along that direction?

These two questions are controlled by the `axis=`

and `join=`

arguments in `pandas.concat`

.
Specifying `axis='rows'`

means that we will stack the DataFrames vertically. Since
we are vertically stacking our DataFrames, we will assume that we may only want
to pull a subset of the columns. If we want our result to contain ALL of the columns
across all of the inputted DataFrames, we can specify `join='outer'`

. On the other hand,
if we only want columns that are shared across all of the DataFrames (i.e., the ones
in common), then we would use `join='inner'`

.

Remember that `axis=`

indicates the direction we are stacking the DataFrames, and
`join=`

controls what rows/columns we consider while performing this stacking operation.
When `axis='rows'`

, the `join=`

exerts control on the resultant columns. Whereas
if `axis='columns'`

(stack horizontally), then `join=`

exerts control on the
resultant rows.

Let’s take a look at this in action:

```
# align vertically, preserving fully overlapping columns indices
pd.concat(dfs, axis='rows', join='outer')
```

x | y | z | ||
---|---|---|---|---|

df1 | a | 0 | 1.0 | NaN |

b | 2 | 3.0 | NaN | |

c | 4 | 5.0 | NaN | |

df2 | b | 6 | NaN | 7.0 |

c | 8 | NaN | 9.0 | |

d | 10 | NaN | 11.0 | |

df3 | b | 12 | NaN | 13.0 |

d | 14 | NaN | 15.0 | |

e | 16 | NaN | 17.0 |

```
# align vertically, preserving ONLY shared columns indices
pd.concat(dfs, axis='rows', join='inner')
```

x | ||
---|---|---|

df1 | a | 0 |

b | 2 | |

c | 4 | |

df2 | b | 6 |

c | 8 | |

d | 10 | |

df3 | b | 12 |

d | 14 | |

e | 16 |

```
# align horizontally, preserving all row indices
pd.concat(dfs, axis='columns', join='outer')
```

df1 | df2 | df3 | ||||
---|---|---|---|---|---|---|

x | y | x | z | x | z | |

a | 0.0 | 1.0 | NaN | NaN | NaN | NaN |

b | 2.0 | 3.0 | 6.0 | 7.0 | 12.0 | 13.0 |

c | 4.0 | 5.0 | 8.0 | 9.0 | NaN | NaN |

d | NaN | NaN | 10.0 | 11.0 | 14.0 | 15.0 |

e | NaN | NaN | NaN | NaN | 16.0 | 17.0 |

```
# align horizontally, preserving ONLY shared row indices
pd.concat(dfs, axis='columns', join='inner')
```

df1 | df2 | df3 | ||||
---|---|---|---|---|---|---|

x | y | x | z | x | z | |

b | 2 | 3 | 6 | 7 | 12 | 13 |

It’s a pretty flexible tool, right? Now that we know what types of problems
`pandas.concat`

solves for us, let’s talk about how this operation is carried
out.

## The Mechanics#

pandas operations are index-aware, meaning that most operations will inspect the index that has been set (on either the rows, columns, or both) and operate with the knowledge that is contained within the index.

Concatenation is not an exception to this rule. In fact, we can recreate the behavior of `pandas.concat`

through the `DataFrame.reindex`

method.
All we need to do is create the shared index (e.g., the inner [intersection] or outer [union]
of all indexes in consideration) and align our frames to that!

Let’s create some `pandas.Series`

objects to concatenate, again highlighting where
their `.index`

line up with one another.

```
from functools import reduce
import pandas as pd
parts = [
# highlighting the alignments
# a b c d e
pd.Series([0, 1, 2], index=['a', 'b', 'c' ], dtype='Int64'),
pd.Series([3, 4, 5], index=[ 'b', 'c', 'd' ], dtype='Int64'),
pd.Series([6, 7, 8], index=[ 'b', 'd', 'e'], dtype='Int64'),
]
indices = [s.index for s in parts]
```

If we take the union/intersection of each of these index objects,
noting that the `outer`

variable contains an index whose values are the union of
ALL values, and the `inner`

variable only contains values that are the intersection
of the index values…

```
display(
(outer := reduce(lambda l, r: l.union(r), indices)), # outer
(inner := reduce(lambda l, r: l.intersection(r), indices)), # inner
)
```

```
Index(['a', 'b', 'c', 'd', 'e'], dtype='object')
```

```
Index(['b'], dtype='object')
```

…then we can use these new index objects to `.reindex`

on each of our Series objects
to combine them into a single output!

```
# outer aligned
pd.DataFrame({i: p.reindex(outer) for i, p in enumerate(parts)})
```

0 | 1 | 2 | |
---|---|---|---|

a | 0 | <NA> | <NA> |

b | 1 | 3 | 6 |

c | 2 | 4 | <NA> |

d | <NA> | 5 | 7 |

e | <NA> | <NA> | 8 |

```
# inner aligned
pd.DataFrame({i: p.reindex(inner) for i, p in enumerate(parts)})
```

0 | 1 | 2 | |
---|---|---|---|

b | 1 | 3 | 6 |

And that’s our conceptual recreation of `pandas.concat`

just using our knowledge
of how the index works!

## Wrap-Up#

That’s all we have time for this week! What do you think about my approach? Anything you’d do differently? Something not making sense? Let me know on the DUTC Discord server.

Make sure you tune back in next week for a fresh post on Python, pandas, Matplotlib, and more. Talk to you all then.