DataFrame Joins & Sets#
There is a fairly strong relationship between table joins and set theory. However, many of the table joins written in SQL, pandas, Polars and the like don’t translate neatly to set logic. In this blog post, I want to clarify this relationship (and show you some Python and pandas code along the way).
Unique Equality Joins#
Let’s start with unique equality joins as they are the prototypical representation of a table-join operation. This is also the only type of join that neatly falls into standard set theory (without expanding to multi-sets, which we’ll discuss later).
For these examples, the join column in both tables is unique. This means that our resultant join will always be one-to-one: one row in the table on the left aligns with one row in the table on the right.
Additionally, we’re enforcing strict equality in our join, meaning that we only want rows from either table when their join columns have values that match EXACTLY.
This type of join falls very neatly into basic set theory. For example, take the following
DataFrames and sets (representing their values on the 'group'
column):
from pandas import DataFrame
df_left = DataFrame({
'group': ['a', 'b', 'c', 'd'],
'value': [ 1 , 2 , 3 , 4 ],
})
df_right = DataFrame({
'group': ['c', 'd', 'e', 'f' ],
'value': [ -3, -4, -5, -6 ],
})
s_left, s_right = set(df_left['group']), set(df_right['group'])
display_grid(
1, 2,
left_df= df_left.pipe(light_bg_table, '#FFDAB9'),
right_df= df_right.pipe(light_bg_table, '#ADD8E6'),
left_set= s_left,
right_set=s_right,
)
Left df
group | value |
---|---|
a | 1 |
b | 2 |
c | 3 |
d | 4 |
Right df
group | value |
---|---|
c | -3 |
d | -4 |
e | -5 |
f | -6 |
Left set
Right set
Here you can see that each DataFrame is composed of two columns, which we will
join on the 'group'
column. Furthermore, you can see that we can represent these
columns via a Python set
. For each of the following operations, we
are going to make the following comparisons:
Join Type |
Set Operation |
---|---|
outer |
union |
inner |
intersection |
left |
left set |
right |
right set |
Before we get started, I’m going to define a couple of helper functions to color our left and right tables as well as include them in each of our outputs.
from functools import partial
def color_parts(df):
def _color_row(row):
bg_dict = {'left_only': '#FFDAB9', 'right_only': '#ADD8E6'}
style = ['']
if row['_merge'] in bg_dict:
style = [
f"background-color: {bg_dict[row['_merge']]}; color: black"
]
return style * len(row)
return (
df.style.apply(_color_row, axis=1)
.format('{:g}', subset=df.select_dtypes('number').columns)
)
display_grid_basis = partial(
display_grid, 1, 2,
left= df_left.pipe(light_bg_table, '#FFDAB9'),
right=df_right.pipe(light_bg_table, '#ADD8E6'),
)
Outer Join & Union#
Let’s look at the typical outer join. This type of join will return a row
for every value in the set union of the 'group'
column from either DataFrame.
Whenever we encounter a value in the 'group'
column that does not exist in either
of the tables, we populate that/those row(s) with NaN/null values.
In pandas, we can pass indicator=True
to create the column '_merge'
that informs
us where each column originated. In this case, our outer merge is able to pull
rows that exist in just the left, just the right, or both of our constituent
DataFrames.
display_grid_basis(
outer_join=df_left
.merge(df_right, on='group', how='outer', indicator=True)
.pipe(color_parts),
set_union=s_left.union(s_right)
)
Left
group | value |
---|---|
a | 1 |
b | 2 |
c | 3 |
d | 4 |
Right
group | value |
---|---|
c | -3 |
d | -4 |
e | -5 |
f | -6 |
Outer join
group | value_x | value_y | _merge | |
---|---|---|---|---|
0 | a | 1 | nan | left_only |
1 | b | 2 | nan | left_only |
2 | c | 3 | -3 | both |
3 | d | 4 | -4 | both |
4 | e | nan | -5 | right_only |
5 | f | nan | -6 | right_only |
Set union
Inner Join & Intersection#
The inner join will only return rows where the unique value in the 'group'
column exists on both sides of our join. This is why the resultant merge
indicator only has 'both'
as a value.
Likewise in our set theory, this type of join corresponds to the intersection
of two sets. Note the agreement between our resultant set and the 'group'
column from the joined table.
display_grid_basis(
inner_join=df_left
.merge(df_right, on='group', how='inner', indicator=True)
.pipe(color_parts),
set_intersection=s_left.intersection(s_right)
)
Left
group | value |
---|---|
a | 1 |
b | 2 |
c | 3 |
d | 4 |
Right
group | value |
---|---|
c | -3 |
d | -4 |
e | -5 |
f | -6 |
Inner join
group | value_x | value_y | _merge | |
---|---|---|---|---|
0 | c | 3 | -3 | both |
1 | d | 4 | -4 | both |
Set intersection
Left Join#
Unlike the outer/inner joins, the left join will result in a table whose values
in the 'group'
column exactly match the values from the original left-side
table. Relating this to set theory, this is really a no-op as we don’t consider
the set on the right-hand side at all.
display_grid_basis(
left_join=df_left
.merge(df_right, on='group', how='left', indicator=True)
.pipe(color_parts),
left_set=s_left
)
Left
group | value |
---|---|
a | 1 |
b | 2 |
c | 3 |
d | 4 |
Right
group | value |
---|---|
c | -3 |
d | -4 |
e | -5 |
f | -6 |
Left join
group | value_x | value_y | _merge | |
---|---|---|---|---|
0 | a | 1 | nan | left_only |
1 | b | 2 | nan | left_only |
2 | c | 3 | -3 | both |
3 | d | 4 | -4 | both |
Left set
Right Join#
For our equivalent right join, we do the same as the left join, except we only
use 'group'
values from the table on the right side.
display_grid_basis(
right_join=df_left
.merge(df_right, on='group', how='right', indicator=True)
.pipe(color_parts),
right_set=s_right,
)
Left
group | value |
---|---|
a | 1 |
b | 2 |
c | 3 |
d | 4 |
Right
group | value |
---|---|
c | -3 |
d | -4 |
e | -5 |
f | -6 |
Right join
group | value_x | value_y | _merge | |
---|---|---|---|---|
0 | c | 3 | -3 | both |
1 | d | 4 | -4 | both |
2 | e | nan | -5 | right_only |
3 | f | nan | -6 | right_only |
Right set
Wrap-Up#
That wraps up our discussion of unique equivalent joins. Next week, we’ll explore how we can incorporate equivalence joins for non-unique columns and discuss where basic set theory falls short of appropriately representing these concepts.
What do you think about my approach? Anything you’d do differently? Something not making sense? Let me know on the DUTC Discord server.
Hope you enjoyed this week’s blog post—talk to you next time!