DataFrame Joins & Sets#

There is a fairly strong relationship between table joins and set theory. However, many of the table joins written in SQL, pandas, Polars and the like don’t translate neatly to set logic. In this blog post, I want to clarify this relationship (and show you some Python and pandas code along the way).

Unique Equality Joins#

Let’s start with unique equality joins as they are the prototypical representation of a table-join operation. This is also the only type of join that neatly falls into standard set theory (without expanding to multi-sets, which we’ll discuss later).

For these examples, the join column in both tables is unique. This means that our resultant join will always be one-to-one: one row in the table on the left aligns with one row in the table on the right.

Additionally, we’re enforcing strict equality in our join, meaning that we only want rows from either table when their join columns have values that match EXACTLY.

This type of join falls very neatly into basic set theory. For example, take the following DataFrames and sets (representing their values on the 'group' column):

from pandas import DataFrame

df_left = DataFrame({
    'group': ['a', 'b', 'c', 'd'],
    'value': [ 1 ,  2 ,  3 ,  4 ],
})

df_right = DataFrame({
    'group': ['c', 'd', 'e', 'f' ],
    'value': [ -3,  -4,  -5,  -6 ],
})

s_left, s_right = set(df_left['group']), set(df_right['group'])

display_grid(
    1, 2,
    left_df=  df_left.pipe(light_bg_table, '#FFDAB9'),
    right_df= df_right.pipe(light_bg_table, '#ADD8E6'),
    left_set= s_left,
    right_set=s_right,
)

Left df

group	value
a	1
b	2
c	3
d	4

Right df

group	value
c	-3
d	-4
e	-5
f	-6

Left set

{'a', 'd', 'c', 'b'}

Right set

{'c', 'e', 'd', 'f'}

Here you can see that each DataFrame is composed of two columns, which we will join on the 'group' column. Furthermore, you can see that we can represent these columns via a Python set. For each of the following operations, we are going to make the following comparisons:

Join Type	Set Operation
outer	union
inner	intersection
left	left set
right	right set

Before we get started, I’m going to define a couple of helper functions to color our left and right tables as well as include them in each of our outputs.

from functools import partial

def color_parts(df):
    def _color_row(row):
        bg_dict = {'left_only': '#FFDAB9', 'right_only': '#ADD8E6'}
        style = ['']
        if row['_merge'] in bg_dict:
            style = [
                f"background-color: {bg_dict[row['_merge']]}; color: black"
            ]
        return style * len(row)
    return (
        df.style.apply(_color_row, axis=1)
        .format('{:g}', subset=df.select_dtypes('number').columns)
    )

display_grid_basis = partial(
    display_grid, 1, 2, 
    left= df_left.pipe(light_bg_table, '#FFDAB9'),
    right=df_right.pipe(light_bg_table, '#ADD8E6'),
)

Outer Join & Union#

Let’s look at the typical outer join. This type of join will return a row for every value in the set union of the 'group' column from either DataFrame. Whenever we encounter a value in the 'group' column that does not exist in either of the tables, we populate that/those row(s) with NaN/null values.

In pandas, we can pass indicator=True to create the column '_merge' that informs us where each column originated. In this case, our outer merge is able to pull rows that exist in just the left, just the right, or both of our constituent DataFrames.

display_grid_basis(
    outer_join=df_left
        .merge(df_right, on='group', how='outer', indicator=True)
        .pipe(color_parts),

    set_union=s_left.union(s_right)
)

Left

group	value
a	1
b	2
c	3
d	4

Right

group	value
c	-3
d	-4
e	-5
f	-6

Outer join

	group	value_x	value_y	_merge
0	a	1	nan	left_only
1	b	2	nan	left_only
2	c	3	-3	both
3	d	4	-4	both
4	e	nan	-5	right_only
5	f	nan	-6	right_only

Set union

{'e', 'f', 'd', 'b', 'a', 'c'}

Inner Join & Intersection#

The inner join will only return rows where the unique value in the 'group' column exists on both sides of our join. This is why the resultant merge indicator only has 'both' as a value.

Likewise in our set theory, this type of join corresponds to the intersection of two sets. Note the agreement between our resultant set and the 'group' column from the joined table.

display_grid_basis(
    inner_join=df_left
        .merge(df_right, on='group', how='inner', indicator=True)
        .pipe(color_parts),
    
    set_intersection=s_left.intersection(s_right)
)

Left

group	value
a	1
b	2
c	3
d	4

Right

group	value
c	-3
d	-4
e	-5
f	-6

Inner join

	group	value_x	value_y	_merge
0	c	3	-3	both
1	d	4	-4	both

Set intersection

{'c', 'd'}

Left Join#

Unlike the outer/inner joins, the left join will result in a table whose values in the 'group' column exactly match the values from the original left-side table. Relating this to set theory, this is really a no-op as we don’t consider the set on the right-hand side at all.

display_grid_basis(
    left_join=df_left
        .merge(df_right, on='group', how='left', indicator=True)
        .pipe(color_parts),
    left_set=s_left
)

Left

group	value
a	1
b	2
c	3
d	4

Right

group	value
c	-3
d	-4
e	-5
f	-6

Left join

	group	value_x	value_y	_merge
0	a	1	nan	left_only
1	b	2	nan	left_only
2	c	3	-3	both
3	d	4	-4	both

Left set

{'a', 'd', 'c', 'b'}

Right Join#

For our equivalent right join, we do the same as the left join, except we only use 'group' values from the table on the right side.

display_grid_basis(
    right_join=df_left
        .merge(df_right, on='group', how='right', indicator=True)
        .pipe(color_parts),
    right_set=s_right,
)

Left

group	value
a	1
b	2
c	3
d	4

Right

group	value
c	-3
d	-4
e	-5
f	-6

Right join

	group	value_x	value_y	_merge
0	c	3	-3	both
1	d	4	-4	both
2	e	nan	-5	right_only
3	f	nan	-6	right_only

Right set

{'c', 'e', 'd', 'f'}

Wrap-Up#

That wraps up our discussion of unique equivalent joins. Next week, we’ll explore how we can incorporate equivalence joins for non-unique columns and discuss where basic set theory falls short of appropriately representing these concepts.

What do you think about my approach? Anything you’d do differently? Something not making sense? Let me know on the DUTC Discord server.

Hope you enjoyed this week’s blog post—talk to you next time!

Parsing Unconventional Text DataFrame Joins & MultiSets

Recent Posts

Tags

DataFrame Joins & Sets#

Unique Equality Joins#

Outer Join & Union#

Inner Join & Intersection#

Left Join#

Right Join#

Wrap-Up#