# DataFrame Joins & Sets#

There is a fairly strong relationship between table joins and set theory. However, many of the table joins written in SQL, pandas, Polars and the like don’t translate neatly to set logic. In this blog post, I want to clarify this relationship (and show you some Python and pandas code along the way).

## Unique Equality Joins#

Let’s start with unique equality joins as they are the prototypical
representation of a table-join operation. This is also the *only* type of
join that neatly falls into standard set theory (without expanding to multi-sets,
which we’ll discuss later).

For these examples, the join column in both tables is unique. This means that our resultant join will always be one-to-one: one row in the table on the left aligns with one row in the table on the right.

Additionally, we’re enforcing strict equality in our join, meaning that we only want rows from either table when their join columns have values that match EXACTLY.

This type of join falls very neatly into basic set theory. For example, take the following
DataFrames and sets (representing their values on the `'group'`

column):

```
from pandas import DataFrame
df_left = DataFrame({
'group': ['a', 'b', 'c', 'd'],
'value': [ 1 , 2 , 3 , 4 ],
})
df_right = DataFrame({
'group': ['c', 'd', 'e', 'f' ],
'value': [ -3, -4, -5, -6 ],
})
s_left, s_right = set(df_left['group']), set(df_right['group'])
display_grid(
1, 2,
left_df= df_left.pipe(light_bg_table, '#FFDAB9'),
right_df= df_right.pipe(light_bg_table, '#ADD8E6'),
left_set= s_left,
right_set=s_right,
)
```

Left df

group | value |
---|---|

a | 1 |

b | 2 |

c | 3 |

d | 4 |

Right df

group | value |
---|---|

c | -3 |

d | -4 |

e | -5 |

f | -6 |

Left set

Right set

Here you can see that each DataFrame is composed of two columns, which we will
join on the `'group'`

column. Furthermore, you can see that we can represent these
columns via a Python `set`

. For each of the following operations, we
are going to make the following comparisons:

Join Type |
Set Operation |
---|---|

outer |
union |

inner |
intersection |

left |
left set |

right |
right set |

Before we get started, I’m going to define a couple of helper functions to color our left and right tables as well as include them in each of our outputs.

```
from functools import partial
def color_parts(df):
def _color_row(row):
bg_dict = {'left_only': '#FFDAB9', 'right_only': '#ADD8E6'}
style = ['']
if row['_merge'] in bg_dict:
style = [
f"background-color: {bg_dict[row['_merge']]}; color: black"
]
return style * len(row)
return (
df.style.apply(_color_row, axis=1)
.format('{:g}', subset=df.select_dtypes('number').columns)
)
display_grid_basis = partial(
display_grid, 1, 2,
left= df_left.pipe(light_bg_table, '#FFDAB9'),
right=df_right.pipe(light_bg_table, '#ADD8E6'),
)
```

### Outer Join & Union#

Let’s look at the typical outer join. This type of join will return a row
for every value in the set union of the `'group'`

column from either DataFrame.
Whenever we encounter a value in the `'group'`

column that does not exist in either
of the tables, we populate that/those row(s) with NaN/null values.

In pandas, we can pass `indicator=True`

to create the column `'_merge'`

that informs
us where each column originated. In this case, our outer merge is able to pull
rows that exist in just the left, just the right, or both of our constituent
DataFrames.

```
display_grid_basis(
outer_join=df_left
.merge(df_right, on='group', how='outer', indicator=True)
.pipe(color_parts),
set_union=s_left.union(s_right)
)
```

Left

group | value |
---|---|

a | 1 |

b | 2 |

c | 3 |

d | 4 |

Right

group | value |
---|---|

c | -3 |

d | -4 |

e | -5 |

f | -6 |

Outer join

group | value_x | value_y | _merge | |
---|---|---|---|---|

0 | a | 1 | nan | left_only |

1 | b | 2 | nan | left_only |

2 | c | 3 | -3 | both |

3 | d | 4 | -4 | both |

4 | e | nan | -5 | right_only |

5 | f | nan | -6 | right_only |

Set union

### Inner Join & Intersection#

The inner join will **only** return rows where the unique value in the `'group'`

column exists on **both** sides of our join. This is why the resultant merge
indicator only has `'both'`

as a value.

Likewise in our set theory, this type of join corresponds to the `intersection`

of two sets. Note the agreement between our resultant set and the `'group'`

column from the joined table.

```
display_grid_basis(
inner_join=df_left
.merge(df_right, on='group', how='inner', indicator=True)
.pipe(color_parts),
set_intersection=s_left.intersection(s_right)
)
```

Left

group | value |
---|---|

a | 1 |

b | 2 |

c | 3 |

d | 4 |

Right

group | value |
---|---|

c | -3 |

d | -4 |

e | -5 |

f | -6 |

Inner join

group | value_x | value_y | _merge | |
---|---|---|---|---|

0 | c | 3 | -3 | both |

1 | d | 4 | -4 | both |

Set intersection

### Left Join#

Unlike the outer/inner joins, the left join will result in a table whose values
in the `'group'`

column exactly match the values from the original left-side
table. Relating this to set theory, this is really a no-op as we don’t consider
the set on the right-hand side at all.

```
display_grid_basis(
left_join=df_left
.merge(df_right, on='group', how='left', indicator=True)
.pipe(color_parts),
left_set=s_left
)
```

Left

group | value |
---|---|

a | 1 |

b | 2 |

c | 3 |

d | 4 |

Right

group | value |
---|---|

c | -3 |

d | -4 |

e | -5 |

f | -6 |

Left join

group | value_x | value_y | _merge | |
---|---|---|---|---|

0 | a | 1 | nan | left_only |

1 | b | 2 | nan | left_only |

2 | c | 3 | -3 | both |

3 | d | 4 | -4 | both |

Left set

### Right Join#

For our equivalent right join, we do the same as the left join, except we only
use `'group'`

values from the table on the right side.

```
display_grid_basis(
right_join=df_left
.merge(df_right, on='group', how='right', indicator=True)
.pipe(color_parts),
right_set=s_right,
)
```

Left

group | value |
---|---|

a | 1 |

b | 2 |

c | 3 |

d | 4 |

Right

group | value |
---|---|

c | -3 |

d | -4 |

e | -5 |

f | -6 |

Right join

group | value_x | value_y | _merge | |
---|---|---|---|---|

0 | c | 3 | -3 | both |

1 | d | 4 | -4 | both |

2 | e | nan | -5 | right_only |

3 | f | nan | -6 | right_only |

Right set

## Wrap-Up#

That wraps up our discussion of unique equivalent joins. Next week, we’ll explore how we can incorporate equivalence joins for non-unique columns and discuss where basic set theory falls short of appropriately representing these concepts.

What do you think about my approach? Anything you’d do differently? Something not making sense? Let me know on the DUTC Discord server.

Hope you enjoyed this week’s blog post—talk to you next time!