DataFrame Joins & MultiSets

There is a fairly strong relationship between table joins and set theory. However, many of the table joins written in SQL, pandas, Polars and the like don't translate neatly to set logic. In this post, I want to clarify this relationship (and show you some Python and pandas code along the way).

Last week, I covered unique equality joins which describes the simplest scenario in which sets and table join logic completely overlap. This parallels the idea that table joins can be represented with Venn diagrams. This week, I want to show where this mode of thinking tends to fall flat.

Non-Unique Equality Joins

Just like last week, we are still discussing equality joins. That is, to join tables, we need to see strict equality for both sides in the join column. Now, we'll introduce having non-unique values in our join columns to cover one-to-many and many-to-many joins.

To refresh us, last week we saw unique equality joins:

from pandas import DataFrame

df_left = DataFrame({
    'group': ['a', 'b', 'c', 'd'],
    'value': [ 1 ,  2 ,  3 ,  4 ],
})

df_right = DataFrame({
    'group': ['c', 'd', 'e', 'f' ],
    'value': [ -3,  -4,  -5,  -6 ],
})

s_left, s_right = set(df_left['group']), set(df_right['group'])

display_grid(
    1, 2,
    left_df=  df_left.pipe(light_bg_table, '#FFDAB9'),
    right_df= df_right.pipe(light_bg_table, '#ADD8E6'),

    outer_join=(
        df_left
        .merge(df_right, on='group', how='outer', indicator=True)
        .pipe(color_parts) # color the output based on the _merge column
    ),
    set_union= s_left | s_right,
)

Left df

group	value
a	1
b	2
c	3
d	4

Right df

group	value
c	-3
d	-4
e	-5
f	-6

Outer join

	group	value_x	value_y	_merge
0	a	1	nan	left_only
1	b	2	nan	left_only
2	c	3	-3	both
3	d	4	-4	both
4	e	nan	-5	right_only
5	f	nan	-6	right_only

Set union

{'b', 'c', 'a', 'd', 'f', 'e'}

We saw that the set enumerated all values in the output of our merged column in the DataFrame. However, the conceptual understanding using sets tends to fall short:

from pandas import DataFrame

# notice that the group column has duplicate (non-unique!) values
df_left = DataFrame({
    'group': ['a', 'b', 'b', 'c', 'c', 'c'],
    'value': [ 1 ,  2 ,  3 ,  4 ,  5,   6 ],
})

df_right = DataFrame({
    'group': ['b', 'c', 'c',  'd'],
    'value': [ -3,  -4,  -5,  -6 ],
})

s_left, s_right = set(df_left['group']), set(df_right['group'])

display_grid(
    1, 2,
    left_df=  df_left.pipe(light_bg_table, '#FFDAB9'),
    right_df= df_right.pipe(light_bg_table, '#ADD8E6'),
    left_set= s_left,
    right_set=s_right,
    
    outer_join=(
        df_left
        .merge(df_right, on='group', how='outer', indicator=True)
        .pipe(color_parts) # color the output based on the _merge column
    ),
    set_union= s_left | s_right,
)

Left df

group	value
a	1
b	2
b	3
c	4
c	5
c	6

Right df

group	value
b	-3
c	-4
c	-5
d	-6

Left set

{'b', 'c', 'a'}

Right set

{'b', 'd', 'c'}

Outer join

	group	value_x	value_y	_merge
0	a	1	nan	left_only
1	b	2	-3	both
2	b	3	-3	both
3	c	4	-4	both
4	c	4	-5	both
5	c	5	-4	both
6	c	5	-5	both
7	c	6	-4	both
8	c	6	-5	both
9	d	nan	-6	right_only

Set union

{'a', 'b', 'd', 'c'}

The union of our two sets are correct in that they enumerate all of the values in the output, but they fail to capture how many of each of those values we expect to observe. Thankfully, we have another data structure that we can rely on to model this: a multiset.

Multisets

A multiset is used to capture unique members of a group as well as their multiplicity (how many of each member is observed). Multisets exhibit all of the same features that a simple set does, meaning we can ask questions about set membership, unions, intersections, etc.

A new feature is an ability to perform counting math with multisets. That is, we can mathematically add and subtract the multiplicity of one multiset from another.

In Python, we can use a collections.Counter to represent a multiset.

Counter Approximation

from collections import Counter

left  = Counter({'a': 1, 'b': 1        })
right = Counter({'a': 1, 'b': 2, 'c': 3})

print(
    f'{left          = }',
    f'{right         = }',
    '',
    
    f'{left & right  = }', # intersection → minimum multiplicity for shared keys
    f'{left | right  = }', # union → maximum multiplicity for shared keys
    f'{left <= right = }', # left is a subset of right
    '',
        
    f'{left + right  = }', # multisets can be summed
    f'{right - left  = }', # 0 multiplicity removes the member from the multiset
    
    sep='\n',
)

left          = Counter({'a': 1, 'b': 1})
right         = Counter({'c': 3, 'b': 2, 'a': 1})

left & right  = Counter({'a': 1, 'b': 1})
left | right  = Counter({'c': 3, 'b': 2, 'a': 1})
left <= right = True

left + right  = Counter({'b': 3, 'c': 3, 'a': 2})
right - left  = Counter({'c': 3, 'b': 1})

As far as an approximation of a multiset goes, the collections.Counter is very good, but we are missing one mathematical operation to best represent our joins (multiplication of multiplicity). For this, let's reach for our good friend, the pandas.Series.

pandas Series Approximation

A pandas.Series provides us with another approximation of a multiset with more flexible behaviors. We already know that the pandas.Index (a core piece of a pandas.Series) exhibits very similar functionality to that of a set .

from pandas import Index

idx1 = Index(['a', 'b', 'c', 'd'          ])
idx2 = Index([          'c', 'd', 'e', 'f'])

print(
    f'{idx1.union(idx2)               = }',
    f'{idx1.intersection(idx2)        = }',
    f'{idx1.difference(idx2)           = }',
    f'{idx2.difference(idx1)           = }',
    f'{idx1.symmetric_difference(idx2) = }',
    sep='\n',
)

idx1.union(idx2)               = Index(['a', 'b', 'c', 'd', 'e', 'f'], dtype='object')
idx1.intersection(idx2)        = Index(['c', 'd'], dtype='object')
idx1.difference(idx2)           = Index(['a', 'b'], dtype='object')
idx2.difference(idx1)           = Index(['e', 'f'], dtype='object')
idx1.symmetric_difference(idx2) = Index(['a', 'b', 'e', 'f'], dtype='object')

Now we can combine a meaningful Index with some paired values into a Series.

df_left['group'].value_counts()

group
c    3
b    2
a    1
Name: count, dtype: int64

This is very similar to the above collections.Counter in that we have a single object with two distinct substructures. The collections.Counter has keys, whereas the pandas.Series has an Index. The collections.Counter has values, whereas the pandas.Series has a typed array (typically backed by NumPy or a pandas.array).

s_left = df_left['group'].value_counts()
s_right = df_right['group'].value_counts()

print(
    # outer join
    s_left.mul(s_right, fill_value=1).astype(int),
    
    # inner join
    s_left.mul(s_right, fill_value=0).astype(int),
    sep='\n'
)

group
a    1
b    2
c    6
d    1
Name: count, dtype: int64
group
a    0
b    2
c    6
d    0
Name: count, dtype: int64

The above tells us that for an outer join, we would expect one 'a' and one 'd' in our result because these rows only appear on one side of the join. Then, when we have overlapping non-unique values, we end up with the Cartesian product of their rows, which we estimate via the product of the multiplicity of those values.

Our inner join is quite similar, except that we observe a multiplicity of 0 for the parts of the table that only exist on either side of the join.

Wrap-Up

It turns out that Venn diagrams and simple sets are not enough to describe the behaviors of table joins. While these representations make it easy to convey the expectation/procedure of joins, we can see that a multiset provides more accurate modeling of join behavior—especially when we have duplicate join keys.

What do you think about my approach? Anything you’d do differently? Something not making sense? Let me know on the DUTC Discord server.