DataFrame Joins & Sets#

There is a fairly strong relationship between table joins and set theory. However, many of the table joins written in SQL, pandas, Polars and the like don’t translate neatly to set logic. In this blog post, I want to clarify this relationship (and show you some Python and pandas code along the way).

Unique Equality Joins#

Let’s start with unique equality joins as they are the prototypical representation of a table-join operation. This is also the only type of join that neatly falls into standard set theory (without expanding to multi-sets, which we’ll discuss later).

For these examples, the join column in both tables is unique. This means that our resultant join will always be one-to-one: one row in the table on the left aligns with one row in the table on the right.

Additionally, we’re enforcing strict equality in our join, meaning that we only want rows from either table when their join columns have values that match EXACTLY.

This type of join falls very neatly into basic set theory. For example, take the following DataFrames and sets (representing their values on the 'group' column):

from pandas import DataFrame

df_left = DataFrame({
    'group': ['a', 'b', 'c', 'd'],
    'value': [ 1 ,  2 ,  3 ,  4 ],
})

df_right = DataFrame({
    'group': ['c', 'd', 'e', 'f' ],
    'value': [ -3,  -4,  -5,  -6 ],
})

s_left, s_right = set(df_left['group']), set(df_right['group'])

display_grid(
    1, 2,
    left_df=  df_left.pipe(light_bg_table, '#FFDAB9'),
    right_df= df_right.pipe(light_bg_table, '#ADD8E6'),
    left_set= s_left,
    right_set=s_right,
)

Left df

group value
a 1
b 2
c 3
d 4

Right df

group value
c -3
d -4
e -5
f -6

Left set

{'b', 'd', 'a', 'c'}

Right set

{'d', 'e', 'f', 'c'}

Here you can see that each DataFrame is composed of two columns, which we will join on the 'group' column. Furthermore, you can see that we can represent these columns via a Python set. For each of the following operations, we are going to make the following comparisons:

Join Type

Set Operation

outer

union

inner

intersection

left

left set

right

right set

Before we get started, I’m going to define a couple of helper functions to color our left and right tables as well as include them in each of our outputs.

from functools import partial

def color_parts(df):
    def _color_row(row):
        bg_dict = {'left_only': '#FFDAB9', 'right_only': '#ADD8E6'}
        style = ['']
        if row['_merge'] in bg_dict:
            style = [
                f"background-color: {bg_dict[row['_merge']]}; color: black"
            ]
        return style * len(row)
    return (
        df.style.apply(_color_row, axis=1)
        .format('{:g}', subset=df.select_dtypes('number').columns)
    )

display_grid_basis = partial(
    display_grid, 1, 2, 
    left= df_left.pipe(light_bg_table, '#FFDAB9'),
    right=df_right.pipe(light_bg_table, '#ADD8E6'),
)

Outer Join & Union#

Let’s look at the typical outer join. This type of join will return a row for every value in the set union of the 'group' column from either DataFrame. Whenever we encounter a value in the 'group' column that does not exist in either of the tables, we populate that/those row(s) with NaN/null values.

In pandas, we can pass indicator=True to create the column '_merge' that informs us where each column originated. In this case, our outer merge is able to pull rows that exist in just the left, just the right, or both of our constituent DataFrames.

display_grid_basis(
    outer_join=df_left
        .merge(df_right, on='group', how='outer', indicator=True)
        .pipe(color_parts),

    set_union=s_left.union(s_right)
)

Left

group value
a 1
b 2
c 3
d 4

Right

group value
c -3
d -4
e -5
f -6

Outer join

  group value_x value_y _merge
0 a 1 nan left_only
1 b 2 nan left_only
2 c 3 -3 both
3 d 4 -4 both
4 e nan -5 right_only
5 f nan -6 right_only

Set union

{'d', 'a', 'f', 'c', 'e', 'b'}

Inner Join & Intersection#

The inner join will only return rows where the unique value in the 'group' column exists on both sides of our join. This is why the resultant merge indicator only has 'both' as a value.

Likewise in our set theory, this type of join corresponds to the intersection of two sets. Note the agreement between our resultant set and the 'group' column from the joined table.

display_grid_basis(
    inner_join=df_left
        .merge(df_right, on='group', how='inner', indicator=True)
        .pipe(color_parts),
    
    set_intersection=s_left.intersection(s_right)
)

Left

group value
a 1
b 2
c 3
d 4

Right

group value
c -3
d -4
e -5
f -6

Inner join

  group value_x value_y _merge
0 c 3 -3 both
1 d 4 -4 both

Set intersection

{'d', 'c'}

Left Join#

Unlike the outer/inner joins, the left join will result in a table whose values in the 'group' column exactly match the values from the original left-side table. Relating this to set theory, this is really a no-op as we don’t consider the set on the right-hand side at all.

display_grid_basis(
    left_join=df_left
        .merge(df_right, on='group', how='left', indicator=True)
        .pipe(color_parts),
    left_set=s_left
)

Left

group value
a 1
b 2
c 3
d 4

Right

group value
c -3
d -4
e -5
f -6

Left join

  group value_x value_y _merge
0 a 1 nan left_only
1 b 2 nan left_only
2 c 3 -3 both
3 d 4 -4 both

Left set

{'b', 'd', 'a', 'c'}

Right Join#

For our equivalent right join, we do the same as the left join, except we only use 'group' values from the table on the right side.

display_grid_basis(
    right_join=df_left
        .merge(df_right, on='group', how='right', indicator=True)
        .pipe(color_parts),
    right_set=s_right,
)

Left

group value
a 1
b 2
c 3
d 4

Right

group value
c -3
d -4
e -5
f -6

Right join

  group value_x value_y _merge
0 c 3 -3 both
1 d 4 -4 both
2 e nan -5 right_only
3 f nan -6 right_only

Right set

{'d', 'e', 'f', 'c'}

Wrap-Up#

That wraps up our discussion of unique equivalent joins. Next week, we’ll explore how we can incorporate equivalence joins for non-unique columns and discuss where basic set theory falls short of appropriately representing these concepts.

What do you think about my approach? Anything you’d do differently? Something not making sense? Let me know on the DUTC Discord server.

Hope you enjoyed this week’s blog post—talk to you next time!