Semi-Join Your DataFrames#
Hello, everyone! Welcome back to Cameron’s Corner. This week, I want to take another look at table joins.
Before we get started, do you need a refresher? Check out my other blog posts on this topic that shares examples in Python, pandas, Polars, and DuckDB!
What is a Semi Join?#
Semi joins follow the same logic as an inner join. So, if you’re already familiar with inner joins, then semi joins will be very easy to understand. Both of these joins focus on preserving the keys from the one table that are also present in another table and follow the logic below:
left = {'a', 'b', 'c' }
right = { 'b', 'c', 'd'}
print(f'{left & right = }') # set intersection → inner join logic
left & right = {'c', 'b'}
In this example, you can think of the left - right as the rows that are unique to the left DataFrame, and the left & right as the intersection—those sweet common rows you want to keep.
If the set logic is no different between one join type and the other, then why do we differentiate “inner” joins and “semi” joins? Well, it turns out that the primary difference is the columns we pull over after aligning our rows: “inner” joins combine the two tables tables into one large table aligning on specific column(s). “Semi” joins take the left table and align its rows in accordance with our inner join logic but do not combine the resultant tables.
If you read the semi join description closely, you might think, “That’s not a join at all! It feels like a filter with more steps.” If you thought this, you would be exactly right.
Let’s take a look at how we can accomplish a semi-join in our favorite tabular data analysis tools: Polars, DuckDB and pandas.
In Polars#
If the obtained rows are the same, then the semi join will only return the keys from the left DataFrame that have matches in the right DataFrame. No additional columns from the right DataFrame will be carried over. Polars makes this easy with the how='semi'
option when joining DataFrames.
from polars import DataFrame
left_df = DataFrame({'id': [*'abc'], 'values': [1,2,3]})
right_df = DataFrame({'id': [*'bcd'], 'values': [4,5,6]})
display_grid(ncols=2, left=left_df, right=right_df)
Left
id | values |
---|---|
str | i64 |
"a" | 1 |
"b" | 2 |
"c" | 3 |
Right
id | values |
---|---|
str | i64 |
"b" | 4 |
"c" | 5 |
"d" | 6 |
Let’s start with an “inner” join for a comparison. Note that when we run this standard join type, we end up with a table that has 3 columns because we are performing two steps here:
align both tables along with their “id” column
combine the columns from the aligned constituent tables
left_df.join(right_df, on='id', how='inner')
id | values | values_right |
---|---|---|
str | i64 | i64 |
"b" | 2 | 4 |
"c" | 3 | 5 |
However, our “semi” join only runs a simplified version of step one, and completely omits step two.
We add no additional columns from the right_df
and instead simply filter the
left_df
to contain the rows whose values match in the "id"
column.
left_df.join(right_df, on='id', how='semi')
id | values |
---|---|
str | i64 |
"b" | 2 |
"c" | 3 |
In DuckDB#
DuckDB takes a similar approach to Polars in that this functionality is achieved from an explicit join operation. You can run a query that looks almost like SQL you’d use for a database. But it doesn’t go crazy with the result set, only returning the rows from the left table where there are matching values in the right table—no extra columns allowed.
import duckdb
res = duckdb.sql('''
select * from left_df
semi join right_df
using (id);
''')
res
┌─────────┬────────┐
│ id │ values │
│ varchar │ int64 │
├─────────┼────────┤
│ b │ 2 │
│ c │ 3 │
└─────────┴────────┘
Here, the query shows us a semi join in action with DuckDB’s SQL syntax. It checks for common rows based on group_name, but only keeps those rows from the left table and nothing more.
In pandas#
The syntax for a semi join in pandas is quite different because it is not implemented as a join/merge operation as a keyword argument in pandas.DataFrame.merge
or pandas.DataFrame.join
. That doesn’t mean we’re out of luck, though. We just have to get a bit more hands-on to achieve the same result.
Since all of the data you are working with is already loaded into memory, there is no need for special directives or any fancy syntax. We can simply check which of the values in the left side .isin
the right side, and filter down from there!
left_df_pd = left_df.to_pandas()
right_df_pd = right_df.to_pandas()
mask = left_df_pd['id'].isin(right_df_pd['id'])
mask
0 False
1 True
2 True
Name: id, dtype: bool
Then we can use this mask to filter our original table:
left_df_pd.loc[mask]
id | values | |
---|---|---|
1 | b | 2 |
2 | c | 3 |
Wrap-Up#
And there you have it—semi joins, the minimalist hero of your DataFrame operations! They’re like the friend who RSVPs to your party but doesn’t bring a plus-one or overload you with small talk. Just the essentials: rows that match between your DataFrames without dragging in unnecessary columns from the other side.
Whether you’re using Polars, DuckDB, or good ol’ pandas, semi joins keep things clean and to the point. It’s all about confirming that your rows have a matching partner, without getting too cozy with the details.
So, next time you’re thinking about an inner join but don’t need the full baggage from the right DataFrame, consider the semi join—your data will thank you for keeping it efficient. Now go forth, join responsibly, and may your datasets always be lean and mean!
What do you think about semi joins? Do you use them in your code? Let me know on the DUTC Discord server.
Talk to you all next week!