# DataFrame Joins & MultiSets#

There is a fairly strong relationship between table joins and set theory. However, many of the table joins written in SQL, pandas, Polars and the like don’t translate neatly to set logic. In this post, I want to clarify this relationship (and show you some Python and pandas code along the way).

Last week, I covered unique equality joins which describes the simplest scenario in which sets and table join logic completely overlap. This parallels the idea that table joins can be represented with Venn diagrams. This week, I want to show where this mode of thinking tends to fall flat.

## Non-Unique Equality Joins#

Just like last week, we are still discussing equality joins. That is, to join tables, we need to see strict equality for both sides in the join column. Now, we’ll introduce having non-unique values in our join columns to cover one-to-many and many-to-many joins.

To refresh us, last week we saw unique equality joins:

```
from pandas import DataFrame
df_left = DataFrame({
'group': ['a', 'b', 'c', 'd'],
'value': [ 1 , 2 , 3 , 4 ],
})
df_right = DataFrame({
'group': ['c', 'd', 'e', 'f' ],
'value': [ -3, -4, -5, -6 ],
})
s_left, s_right = set(df_left['group']), set(df_right['group'])
display_grid(
1, 2,
left_df= df_left.pipe(light_bg_table, '#FFDAB9'),
right_df= df_right.pipe(light_bg_table, '#ADD8E6'),
outer_join=(
df_left
.merge(df_right, on='group', how='outer', indicator=True)
.pipe(color_parts) # color the output based on the _merge column
),
set_union= s_left | s_right,
)
```

Left df

group | value |
---|---|

a | 1 |

b | 2 |

c | 3 |

d | 4 |

Right df

group | value |
---|---|

c | -3 |

d | -4 |

e | -5 |

f | -6 |

Outer join

group | value_x | value_y | _merge | |
---|---|---|---|---|

0 | a | 1 | nan | left_only |

1 | b | 2 | nan | left_only |

2 | c | 3 | -3 | both |

3 | d | 4 | -4 | both |

4 | e | nan | -5 | right_only |

5 | f | nan | -6 | right_only |

Set union

We saw that the set enumerated all values in the output of our merged column in the DataFrame. However, the conceptual understanding using sets tends to fall short:

```
from pandas import DataFrame
# notice that the group column has duplicate (non-unique!) values
df_left = DataFrame({
'group': ['a', 'b', 'b', 'c', 'c', 'c'],
'value': [ 1 , 2 , 3 , 4 , 5, 6 ],
})
df_right = DataFrame({
'group': ['b', 'c', 'c', 'd'],
'value': [ -3, -4, -5, -6 ],
})
s_left, s_right = set(df_left['group']), set(df_right['group'])
display_grid(
1, 2,
left_df= df_left.pipe(light_bg_table, '#FFDAB9'),
right_df= df_right.pipe(light_bg_table, '#ADD8E6'),
left_set= s_left,
right_set=s_right,
outer_join=(
df_left
.merge(df_right, on='group', how='outer', indicator=True)
.pipe(color_parts) # color the output based on the _merge column
),
set_union= s_left | s_right,
)
```

Left df

group | value |
---|---|

a | 1 |

b | 2 |

b | 3 |

c | 4 |

c | 5 |

c | 6 |

Right df

group | value |
---|---|

b | -3 |

c | -4 |

c | -5 |

d | -6 |

Left set

Right set

Outer join

group | value_x | value_y | _merge | |
---|---|---|---|---|

0 | a | 1 | nan | left_only |

1 | b | 2 | -3 | both |

2 | b | 3 | -3 | both |

3 | c | 4 | -4 | both |

4 | c | 4 | -5 | both |

5 | c | 5 | -4 | both |

6 | c | 5 | -5 | both |

7 | c | 6 | -4 | both |

8 | c | 6 | -5 | both |

9 | d | nan | -6 | right_only |

Set union

The union of our two sets are correct in that they enumerate all of the values in the output, but they fail to capture **how many** of each of those values we expect to observe. Thankfully, we have another data structure that we can rely on to model this: a multiset.

## Multisets#

A multiset is used to capture unique members of a group as well as their multiplicity (how many of each member is observed). Multisets exhibit all of the same features that a simple set does, meaning we can ask questions about set membership, unions, intersections, etc.

A new feature is an ability to perform counting math with multisets. That is, we can mathematically add and subtract the multiplicity of one multiset from another.

In Python, we can use a `collections.Counter`

to represent a multiset.

### Counter Approximation#

```
from collections import Counter
left = Counter({'a': 1, 'b': 1 })
right = Counter({'a': 1, 'b': 2, 'c': 3})
print(
f'{left = }',
f'{right = }',
'',
f'{left & right = }', # intersection → minimum multiplicity for shared keys
f'{left | right = }', # union → maximum multiplicity for shared keys
f'{left <= right = }', # left is a subset of right
'',
f'{left + right = }', # multisets can be summed
f'{right - left = }', # 0 multiplicity removes the member from the multiset
sep='\n',
)
```

```
left = Counter({'a': 1, 'b': 1})
right = Counter({'c': 3, 'b': 2, 'a': 1})
left & right = Counter({'a': 1, 'b': 1})
left | right = Counter({'c': 3, 'b': 2, 'a': 1})
left <= right = True
left + right = Counter({'b': 3, 'c': 3, 'a': 2})
right - left = Counter({'c': 3, 'b': 1})
```

As far as an approximation of a multiset goes, the `collections.Counter`

is very good, but we are missing one mathematical operation to best represent our joins (multiplication of multiplicity). For this, let’s reach for our good friend, the `pandas.Series`

.

### pandas Series Approximation#

A `pandas.Series`

provides us with another approximation of a multiset with more flexible behaviors. We already know that the `pandas.Index`

(a core piece of a `pandas.Series`

) exhibits very similar functionality to that of a set .

```
from pandas import Index
idx1 = Index(['a', 'b', 'c', 'd' ])
idx2 = Index([ 'c', 'd', 'e', 'f'])
print(
f'{idx1.union(idx2) = }',
f'{idx1.intersection(idx2) = }',
f'{idx1.difference(idx2) = }',
f'{idx2.difference(idx1) = }',
f'{idx1.symmetric_difference(idx2) = }',
sep='\n',
)
```

```
idx1.union(idx2) = Index(['a', 'b', 'c', 'd', 'e', 'f'], dtype='object')
idx1.intersection(idx2) = Index(['c', 'd'], dtype='object')
idx1.difference(idx2) = Index(['a', 'b'], dtype='object')
idx2.difference(idx1) = Index(['e', 'f'], dtype='object')
idx1.symmetric_difference(idx2) = Index(['a', 'b', 'e', 'f'], dtype='object')
```

Now we can combine a meaningful `Index`

with some paired values into a `Series`

.

```
df_left['group'].value_counts()
```

```
group
c 3
b 2
a 1
Name: count, dtype: int64
```

This is very similar to the above `collections.Counter`

in that we have a single
object with two distinct substructures. The `collections.Counter`

has keys, whereas
the `pandas.Series`

has an `Index`

. The `collections.Counter`

has values, whereas
the `pandas.Series`

has a typed array (typically backed by NumPy or a `pandas.array`

).

```
s_left = df_left['group'].value_counts()
s_right = df_right['group'].value_counts()
print(
# outer join
s_left.mul(s_right, fill_value=1).astype(int),
# inner join
s_left.mul(s_right, fill_value=0).astype(int),
sep='\n'
)
```

```
group
a 1
b 2
c 6
d 1
Name: count, dtype: int64
group
a 0
b 2
c 6
d 0
Name: count, dtype: int64
```

The above tells us that for an outer join, we would expect one ‘a’ and one ‘d’ in our result because these rows only appear on one side of the join. Then, when we have overlapping non-unique values, we end up with the Cartesian product of their rows, which we estimate via the product of the multiplicity of those values.

Our inner join is quite similar, except that we observe a multiplicity of 0 for the parts of the table that only exist on either side of the join.

## Wrap-Up#

It turns out that Venn diagrams and simple sets are not enough to describe the behaviors of table joins. While these representations make it easy to convey the expectation/procedure of joins, we can see that a multiset provides more accurate modeling of join behavior—especially when we have duplicate join keys.

What do you think about my approach? Anything you’d do differently? Something not making sense? Let me know on the DUTC Discord server.

Hope you enjoyed this week’s blog post—talk to you next time!