DataFrames: one-hot encoding and back#
Welcome to this week’s edition of Cameron’s Corner! This week, I want to take a quick dive into one-hot encoding and how we can use it within our popular tabular data tools in Python.
For some background, one-hot encoding is a technique used in machine learning and data preprocessing to convert categorical data into a binary format where each category is represented as a unique vector of 0s and 1s. It is commonly used when dealing with categorical variables in algorithms that require numerical input, such as neural networks and decision trees. Each category is transformed into a binary vector with one element set to 1 and all others set to 0, allowing models to interpret the data without assigning implicit ordinal relationships. One-hot encoding is especially useful in cases like text classification, image recognition, and natural language processing tasks.
Mechanically, one-hot encoding takes in a dense array of labels and converts it into a sparse representation which has as many columns as there are unique values in the original dataset.
In this blog post, we will also see how we can make the inverse transformation—that is moving from a sparse array—back to a dense representation of the label data.
pandas#
dense → sparse
Let’s start by making some label data, with some known unique categories.
from pandas import Series, DataFrame
pd_dense = Series([*'ABCDA'], dtype='category', name='label')
pd_dense.to_frame()
label | |
---|---|
0 | A |
1 | B |
2 | C |
3 | D |
4 | A |
To one-hot encode this data, we can iterate over the unique categories from the data and simply create a new column for each label. In each of these new columns, we observe a binary output where 1s are an affirmation that the given label appeared in this position and 0s indicate the opposite.
Take a look at the following output and see if you can logically map where
each label 'A'
appeared in the previous pandas.DataFrame
.
(
DataFrame({
label: pd_dense == label for label in pd_dense.cat.categories
})
.astype(int)
)
A | B | C | D | |
---|---|---|---|---|
0 | 1 | 0 | 0 | 0 |
1 | 0 | 1 | 0 | 0 |
2 | 0 | 0 | 1 | 0 |
3 | 0 | 0 | 0 | 1 |
4 | 1 | 0 | 0 | 0 |
While doing transformations in a mechanical manner (as we did above) is a fun learning exercise, we can also use the very handy pandas.get_dummies
function to arrive at the same result.
from pandas import get_dummies
pd_sparse = get_dummies(pd_dense).astype(int)
pd_sparse
A | B | C | D | |
---|---|---|---|---|
0 | 1 | 0 | 0 | 0 |
1 | 0 | 1 | 0 | 0 |
2 | 0 | 0 | 1 | 0 |
3 | 0 | 0 | 0 | 1 |
4 | 1 | 0 | 0 | 0 |
sparse → dense
But how do we go back? Well we can opt to use the very convenient
[pandas.from_dummies]
function.
from pandas import from_dummies
pd_dense = from_dummies(pd_sparse).set_axis(['label'], axis=1)
pd_dense
label | |
---|---|
0 | A |
1 | B |
2 | C |
3 | D |
4 | A |
Since we can reliably roundtrip between these two data representations in pandas, let’s see if its blazing-fast counterpart, Polars, is up to the same challenge.
Polars#
First, let’s carry over our DataFrame from pandas and convert our label column to an polars.Enum
datatype using some
of the metadata provided by pandas.Categorical
Series.
from polars import from_pandas, Enum
pl_dense = (
from_pandas(pd_dense)
.cast({'label': Enum(pd_dense['label'].cat.categories)})
)
pl_dense
label |
---|
enum |
"A" |
"B" |
"C" |
"D" |
"A" |
dense → sparse
The easy way to one-hot encode our data is to use the polars.DataFrame.to_dummies
method. This generates the expected result, but it is unfortunately only available on our polars.DataFrame
and not the polars.LazyFrame
.
pl_dense['label'].to_dummies()
label_A | label_B | label_C | label_D |
---|---|---|---|
u8 | u8 | u8 | u8 |
1 | 0 | 0 | 0 |
0 | 1 | 0 | 0 |
0 | 0 | 1 | 0 |
0 | 0 | 0 | 1 |
1 | 0 | 0 | 0 |
So, how do we one-hot encode a polars.LazyFrame
then? Well, we actually take a similar approach as we did in our first pandas example. Iterate
over the unique values and generate a boolean mask for each one of those
values.
from polars import col, UInt8
pl_dense_lazy = pl_dense.lazy()
# extract categories from the Enum dtype
categories = pl_dense_lazy.collect_schema()['label'].categories
pl_sparse = pl_dense_lazy.select(
(col('label').to_physical() == i).alias(label).cast(UInt8)
for i, label in enumerate(categories)
).collect()
pl_sparse
A | B | C | D |
---|---|---|---|
u8 | u8 | u8 | u8 |
1 | 0 | 0 | 0 |
0 | 1 | 0 | 0 |
0 | 0 | 1 | 0 |
0 | 0 | 0 | 1 |
1 | 0 | 0 | 0 |
sparse → dense
Unlike pandas, Polars has no convenient from_dummies
function. This
means that we’ll need to implement this on our own! Thankfully it is quite
straightforward given the fantastic expressions API that Polars has.
from polars import coalesce, when as pl_when
pl_sparse.select(
coalesce(
pl_when(col(name) == 1).then(i)
for i, name in enumerate(pl_sparse.columns)
).cast(Enum(pl_sparse.columns))
)
literal |
---|
enum |
"A" |
"B" |
"C" |
"D" |
"A" |
Making sense of the code above, we build up a single expression per column whose values are going to map to the positional location of the column. The result of the intermediate operation is shown below:
pl_intermediate = (
pl_sparse.select(
pl_when(col(name) == 1).then(i).alias(name)
for i, name in enumerate(pl_sparse.columns)
)
)
pl_intermediate
A | B | C | D |
---|---|---|---|
i32 | i32 | i32 | i32 |
0 | null | null | null |
null | 1 | null | null |
null | null | 2 | null |
null | null | null | 3 |
0 | null | null | null |
We can then coalesce these results into a dense representation and cast
it to our Enum
datatype to preserve the label values.
pl_sparse.select(
coalesce(pl_intermediate).alias('intermediate'),
coalesce(pl_intermediate)
.cast(Enum(pl_sparse.columns))
.alias('result'),
)
intermediate | result |
---|---|
i32 | enum |
0 | "A" |
1 | "B" |
2 | "C" |
3 | "D" |
0 | "A" |
Wrap-Up#
And now you can confidently one-hot encode your data in both pandas and Polars! Stay tuned for next week when I extend these transformations to DuckDB.
What do you think about my approach? Do you use one-hot coding? Let me know on the DUTC Discord server.
Talk to you all next week!