DataFrames: one-hot encoding and back#

Welcome to this week’s edition of Cameron’s Corner! This week, I want to take a quick dive into one-hot encoding and how we can use it within our popular tabular data tools in Python.

For some background, one-hot encoding is a technique used in machine learning and data preprocessing to convert categorical data into a binary format where each category is represented as a unique vector of 0s and 1s. It is commonly used when dealing with categorical variables in algorithms that require numerical input, such as neural networks and decision trees. Each category is transformed into a binary vector with one element set to 1 and all others set to 0, allowing models to interpret the data without assigning implicit ordinal relationships. One-hot encoding is especially useful in cases like text classification, image recognition, and natural language processing tasks.

Mechanically, one-hot encoding takes in a dense array of labels and converts it into a sparse representation which has as many columns as there are unique values in the original dataset.

In this blog post, we will also see how we can make the inverse transformation—that is moving from a sparse array—back to a dense representation of the label data.

pandas#

dense → sparse

Let’s start by making some label data, with some known unique categories.

from pandas import Series, DataFrame

pd_dense = Series([*'ABCDA'], dtype='category', name='label')
pd_dense.to_frame()
label
0 A
1 B
2 C
3 D
4 A

To one-hot encode this data, we can iterate over the unique categories from the data and simply create a new column for each label. In each of these new columns, we observe a binary output where 1s are an affirmation that the given label appeared in this position and 0s indicate the opposite.

Take a look at the following output and see if you can logically map where each label 'A' appeared in the previous pandas.DataFrame.

(
    DataFrame({
        label: pd_dense == label for label in pd_dense.cat.categories
    })
    .astype(int)
)
A B C D
0 1 0 0 0
1 0 1 0 0
2 0 0 1 0
3 0 0 0 1
4 1 0 0 0

While doing transformations in a mechanical manner (as we did above) is a fun learning exercise, we can also use the very handy pandas.get_dummies function to arrive at the same result.

from pandas import get_dummies

pd_sparse = get_dummies(pd_dense).astype(int)
pd_sparse
A B C D
0 1 0 0 0
1 0 1 0 0
2 0 0 1 0
3 0 0 0 1
4 1 0 0 0

sparse → dense

But how do we go back? Well we can opt to use the very convenient [pandas.from_dummies] function.

from pandas import from_dummies

pd_dense = from_dummies(pd_sparse).set_axis(['label'], axis=1)
pd_dense
label
0 A
1 B
2 C
3 D
4 A

Since we can reliably roundtrip between these two data representations in pandas, let’s see if its blazing-fast counterpart, Polars, is up to the same challenge.

Polars#

First, let’s carry over our DataFrame from pandas and convert our label column to an polars.Enum datatype using some of the metadata provided by pandas.Categorical Series.

from polars import from_pandas, Enum

pl_dense = (
    from_pandas(pd_dense)
    .cast({'label': Enum(pd_dense['label'].cat.categories)})
)

pl_dense
shape: (5, 1)
label
enum
"A"
"B"
"C"
"D"
"A"

dense → sparse

The easy way to one-hot encode our data is to use the polars.DataFrame.to_dummies method. This generates the expected result, but it is unfortunately only available on our polars.DataFrame and not the polars.LazyFrame.

pl_dense['label'].to_dummies()
shape: (5, 4)
label_Alabel_Blabel_Clabel_D
u8u8u8u8
1000
0100
0010
0001
1000

So, how do we one-hot encode a polars.LazyFrame then? Well, we actually take a similar approach as we did in our first pandas example. Iterate over the unique values and generate a boolean mask for each one of those values.

from polars import col, UInt8

pl_dense_lazy = pl_dense.lazy()

# extract categories from the Enum dtype
categories = pl_dense_lazy.collect_schema()['label'].categories

pl_sparse = pl_dense_lazy.select(
    (col('label').to_physical() == i).alias(label).cast(UInt8)
    for i, label in enumerate(categories)
).collect()

pl_sparse
shape: (5, 4)
ABCD
u8u8u8u8
1000
0100
0010
0001
1000

sparse → dense

Unlike pandas, Polars has no convenient from_dummies function. This means that we’ll need to implement this on our own! Thankfully it is quite straightforward given the fantastic expressions API that Polars has.

from polars import coalesce, when as pl_when

pl_sparse.select(
    coalesce(
        pl_when(col(name) == 1).then(i)
        for i, name in enumerate(pl_sparse.columns)
    ).cast(Enum(pl_sparse.columns))
)
shape: (5, 1)
literal
enum
"A"
"B"
"C"
"D"
"A"

Making sense of the code above, we build up a single expression per column whose values are going to map to the positional location of the column. The result of the intermediate operation is shown below:

pl_intermediate = (
    pl_sparse.select(
        pl_when(col(name) == 1).then(i).alias(name)
        for i, name in enumerate(pl_sparse.columns)
    )
)

pl_intermediate
shape: (5, 4)
ABCD
i32i32i32i32
0nullnullnull
null1nullnull
nullnull2null
nullnullnull3
0nullnullnull

We can then coalesce these results into a dense representation and cast it to our Enum datatype to preserve the label values.

pl_sparse.select(
    coalesce(pl_intermediate).alias('intermediate'),
    
    coalesce(pl_intermediate)
    .cast(Enum(pl_sparse.columns))
    .alias('result'),
)
shape: (5, 2)
intermediateresult
i32enum
0"A"
1"B"
2"C"
3"D"
0"A"

Wrap-Up#

And now you can confidently one-hot encode your data in both pandas and Polars! Stay tuned for next week when I extend these transformations to DuckDB.

What do you think about my approach? Do you use one-hot coding? Let me know on the DUTC Discord server.

Talk to you all next week!