A FlagEnum Categorical in pandas

Hi all, welcome back to Cameron's Corner! This week, I want to explore the encoding of combinatoric sets (from a limited pool) inside a pandas.DataFrame. In more colloquial terms, I want to explore the following example:

We have a catalog of programming articles & videos (entities).
Each entity is associated with descriptive tags (e.g., Python, pandas, Matplotlib).
How can we use a pandas.Categorical to encode all unique combinations of a given set of tags?

Before we get started, I want to let you know about the FINAL session of our seminar series, “(Even More) Python Basics for Experts.” Join James in covering (even more) Python basics that any aspiring Python expert needs to know in order to make their code more effective and efficient. He’ll tackle what’s real, how we can tell it’s real, and how we can do less work.

Back to the topic at hand!

Typically, this problem is solved in a database by creating a secondary table that exhibits a 1:many relationship between entities and tags. However, for the purposes of this explorative example, I wanted to use DataFrames to encode this information without the need for an explicit secondary data structure.

Note that the below ideas are to provide discussion and are not a recommendation that anyone encodes their data in this manner as it likely will not scale well.

Before we dive into pandas, let’s talk about how we can represent these ideas in pure Python.

Python's Flag Enum

In Python, we can use enum.Enum to represent a limited set of unique members. For our purposes, we are going to use enum.Flag, which encompasses enum.Enum’s behavior with an important feature: the ability to uniquely encode combinations of its members.

Let's first focus on the basic enum capabilities.

from enum import Enum, Flag

EnumTags = Enum('EnumTags', 'python pandas matplotlib')
FlagTags = Flag('FlagTags', 'python pandas matplotlib')

display(
    [*EnumTags],
    [*FlagTags],
)

[<EnumTags.python: 1>, <EnumTags.pandas: 2>, <EnumTags.matplotlib: 3>]

[<FlagTags.python: 1>, <FlagTags.pandas: 2>, <FlagTags.matplotlib: 4>]

Take note of the values adjacent to each of the enum.Enum members in the above output.

Notice that the standard enum.Enum values increment by one for each member: 1 → 2 → 3. Whereas, the enum.Flag values increment from 1 → 2 → 4. Where do these values come from?

Well, in any enum.Enum subclass, these values are created by Enum._generate_next_value. For enum.Flag, our members follow the pattern 2**i, where i is each value from 0 to len(members). We can take a look at the source code to verify this:

from inspect import getsource

# from this point forward we will only use enum.Flag 
Tags = Flag('Tags', 'python pandas matplotlib')


print(getsource(Tags._generate_next_value_))

    @staticmethod
    def _generate_next_value_(name, start, count, last_values):
        """
        Generate the next value when not given.

        name: the name of the member
        start: the initial start value or None
        count: the number of existing members
        last_values: the last value assigned or None
        """
        if not count:
            return start if start is not None else 1
        last_value = max(last_values)
        try:
            high_bit = _high_bit(last_value)
        except Exception:
            raise TypeError('invalid flag value %r' % last_value) from None
        return 2 ** (high_bit+1)

We can even recreate these values using the above logic with NumPy quite readily:

from numpy import arange

2 ** arange(len(Tags))

array([1, 2, 4])

Integers and Bits

The reason for using the above pattern is because we can use bytearrays to represent our unique members. If I were to decompose these integers into their corresponding binary representations, we end up with a pattern that is a 1:1 match of a dummy coding of our tags.

from pandas import DataFrame

DataFrame(
    columns=['python', 'pandas', 'matplotlib'],
    data=[
        [0, 0, 1], # python      1
        [0, 1, 0], # pandas      2
        [1, 0, 0], # matplotlib  4
    ]
)

	python	pandas	matplotlib
0	0	0	1
1	0	1	0
2	1	0	0

The Flag part of enum.Flag

While it is interesting how Python generates integer values that back the enum.Flag, the more interesting part is that this pattern enables us to combine enum members.

In the above DataFrame we can easily encode a value which contains 'python', 'matplotlib'.

from pandas import DataFrame

DataFrame(
    columns=['python', 'pandas', 'matplotlib'],
    data=[
        [0, 1, 1], # pandas | matplotlib  3
        [1, 0, 1], # python | matplotlib  5
    ]
)

	python	pandas	matplotlib
0	0	1	1
1	1	0	1

In fact, this is exactly what Flag does under the hood. We can readily combine its members:

print(
    f'{Tags.python!r}',
    f'{Tags.python | Tags.pandas!r}',
    f'{Tags.python | Tags.matplotlib!r}',
    f'{Tags.matplotlib | Tags.python!r}', # same as python|matplotlib
    f'{Tags.matplotlib | Tags.pandas!r}', # same as python|matplotlib
    f'{Tags.python | Tags.pandas | Tags.matplotlib!r}',
    sep='\n'
)

<Tags.python: 1>
<Tags.python|pandas: 3>
<Tags.python|matplotlib: 5>
<Tags.python|matplotlib: 5>
<Tags.pandas|matplotlib: 6>
<Tags.python|pandas|matplotlib: 7>

As an aside, if you've used Python's re module and have had to pass multiple flags, then you've actually already used this feature.

Notice that I can combine members of our tags—for our purposes, this feature is a very convenient behavior because any entity we have can have be associated with 0 or more tags. With this pattern we can represent ANY combination of tag members as an integer while preserving the entirety of the information conveyed by that tag-set.

All Flag Options

Let's ditch the Python implementation and instead use NumPy/pandas to represent this same type of logic. We've already seen a short example above, but I'll enumerate all possibilities for a tag-set with a size of three.

from numpy import arange
from pandas import DataFrame

df = DataFrame(
    columns=['matplotlib', 'pandas', 'python'],
    data=[
        [0, 0, 0], # None              0
        [0, 0, 1], # python            1
        [0, 1, 0], # pandas            2
        [0, 1, 1], # python|pandas     3
        [1, 0, 0], # mpl               4
        [1, 0, 1], # mpl|python        5
        [1, 1, 0], # mpl|pandas        6
        [1, 1, 1], # python|pandas|mpl 7
    ]
)

# we can convert the binary representations to integers
df['encoded'] = df @ (2 ** arange(df.columns.size)[::-1])
df

	matplotlib	pandas	python	encoded
0	0	0	0	0
1	0	0	1	1
2	0	1	0	2
3	0	1	1	3
4	1	0	0	4
5	1	0	1	5
6	1	1	0	6
7	1	1	1	7

But, what if we don't want to write out all of those bit permutations by hand? Well, it turns out we can leverage a little NumPy to generate all of them for us:

from numpy import arange, zeros

def bitwise_permutations(n):
    num_permutations = 2 ** n
    indices = arange(num_permutations, dtype='uint64')
    bitwise_array = zeros((num_permutations, n), dtype='uint8')
    for i in range(n):
        bitwise_array[:, n - i - 1] = (indices >> i) & 1
    return bitwise_array

bitwise_permutations(len(Tags))

array([[0, 0, 0],
       [0, 0, 1],
       [0, 1, 0],
       [0, 1, 1],
       [1, 0, 0],
       [1, 0, 1],
       [1, 1, 0],
       [1, 1, 1]], dtype=uint8)

Now we can construct our Tags DataFrame quite easily:

from enum import Flag
from pandas import DataFrame

Tags = Flag('Tags', 'python pandas matplotlib')

df = DataFrame(
    columns=[t.name for t in reversed(Tags)],
    data=bitwise_permutations(len(Tags)),
).astype('uint8')

df = df.assign(
    # note that the dot product below can be replaced with arange(len(bitcodes))
    #   because bitwise_permutations generates these values in order.
    #   I wanted to demonstrate the coupling between these entities.
    encoded=df @ (2 ** arange(df.columns.size)[::-1]),
    
    # create human-readable representations (1 → python, 3 → pandas|python)
    human=[
        ','.join(df.columns[row]) for row in df.to_numpy().astype(bool)
    ]
)

df

	matplotlib	pandas	python	encoded	human
0	0	0	0	0
1	0	0	1	1	python
2	0	1	0	2	pandas
3	0	1	1	3	pandas,python
4	1	0	0	4	matplotlib
5	1	0	1	5	matplotlib,python
6	1	1	0	6	matplotlib,pandas
7	1	1	1	7	matplotlib,pandas,python

Now that we have an enumeration of all possible tags and their combinations (backed by numbers 0:2**N), we can represent this information as a pandas.CategoricalDtype.

from pandas import CategoricalDtype

FlagDtype = CategoricalDtype(categories=df['human'])
FlagDtype

CategoricalDtype(categories=['', 'python', 'pandas', 'pandas,python', 'matplotlib',
                  'matplotlib,python', 'matplotlib,pandas',
                  'matplotlib,pandas,python'],
, ordered=False, categories_dtype=object)

You’re probably thinking that this is a lot of work just to create a pandas.CategoricalDtype. But, let’s take this to an applied example to see how representing tag-sets with as a pandas.CategoricalDtype can vastly improve our memory footprint and lookup speed compared to operating along equivalent object sets (tuples or strings).

Applied Example

Let's start with some string data representing various tags that are associated with an arbitrary number of records/entities.

from enum import Flag
from pandas import Series
from numpy.random import default_rng
from numpy.ma import masked_array

rng = default_rng(1)

Tags = Flag('Tags', 'python pandas matplotlib')
tag_pool = [t.name for t in Tags]

# starting from the human representation
tag_series = Series([
    ','.join(
        sorted(
            rng.choice(
                tag_pool, 
                size=rng.integers(len(Tags)),
                replace=False
            )
        )
     )
     for _ in range(100_000)
])

tag_series

0                   pandas
1            pandas,python
2            pandas,python
3        matplotlib,python
4                         
               ...        
99995           matplotlib
99996               pandas
99997           matplotlib
99998                     
99999           matplotlib
Length: 100000, dtype: object

The first thing we need to do is transform these strings into a sparse representation of the entities they contain. In pandas, we can readily do this with Series.str.get_dummies (remember when I said this pattern is a 1:1 match with multi-value dummy coding?).

bits = tag_series.str.get_dummies(sep=',').astype('uint8')
bits.head()

	matplotlib	pandas	python
0	0	1	0
1	0	1	1
2	0	1	1
3	1	0	1
4	0	0	0

Now we can readily create the combinations of all categories by using the information contained entirely with the column of the above DataFrame.

from enum import Flag
from pandas import DataFrame, Categorical, CategoricalDtype
from numpy import arange

# Create bitcode for all possible tag combinations
perm_bitcodes = bitwise_permutations(bits.columns.size)
perms_names = [','.join(bits.columns[row]) for row in perm_bitcodes.astype(bool)]
FlagDtype = CategoricalDtype(categories=perms_names)


# Create integers from the observed bits
intcodes = bits @ (2 ** arange(bits.columns.size)[::-1]) 
result = bits.assign(
    raw=tag_series,
    encoded=Categorical.from_codes(intcodes, dtype=FlagDtype),
)

display(
    result,
    
    result
    .dtypes.to_frame('dtypes')
    .assign(memory=result.memory_usage(deep=True))
)

	matplotlib	pandas	python	raw	encoded
0	0	1	0	pandas	pandas
1	0	1	1	pandas,python	pandas,python
2	0	1	1	pandas,python	pandas,python
3	1	0	1	matplotlib,python	matplotlib,python
4	0	0	0
...	...	...	...	...	...
99995	1	0	0	matplotlib	matplotlib
99996	0	1	0	pandas	pandas
99997	1	0	0	matplotlib	matplotlib
99998	0	0	0
99999	1	0	0	matplotlib	matplotlib

100000 rows × 5 columns

	dtypes	memory
matplotlib	uint8	100000
pandas	uint8	100000
python	uint8	100000
raw	object	5665503
encoded	category	100785

Since we are using a pandas.Categorical to hold this information, we would expect to see large benefits to our memory footprint as well as comparative operations. Furthermore, by using the underlying bit patterns, we can readily perform set-like comparisons (but I'll save this idea for a different blog post).

%timeit -n1 -r1 result['raw'] == 'pandas,python'
%timeit -n1 -r1 result['encoded'] == 'pandas,python'

# both result sets are equivalent
assert (
    ((result['raw'] == 'pandas,python') == (result['encoded'] == 'pandas,python'))
    .all()
)

6.72 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
517 μs ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

To-Do

While we still have a ways to go to create an ergonomic enum.Flag Categorical datatype in pandas, we have the core pieces of logic. The next steps would involve working with intermediated values within a pandas.array, similar to how datetime arrays are full of pandas.Timestamps:

from pandas import array, to_datetime

arr = array(to_datetime(['2000-01-01', '2000-01-02']))

print(
    arr,
    f"{arr[0] = }",
    sep='\n{}\n'.format(f'\N{box drawings light horizontal}' * 40)
)

<DatetimeArray>
['2000-01-01 00:00:00', '2000-01-02 00:00:00']
Length: 2, dtype: datetime64[ns]
────────────────────────────────────────
arr[0] = Timestamp('2000-01-01 00:00:00')

By using a pandas.api.extensions.ExtensionArray alongside an intermediated value, we can set up ergonomic queries for various setlike operations such as checking exact equality or subsets/supersets.

Wrap-Up

That was a lot of pandas! While we solved the computational part of this problem, there is still a lot more to implement. Of course, we could also further decompose this into a separate table with a 1:m relationship which would have a favorable trade-off if you exceed 64 unique tags (as this would exceed the maximum size of our encoding here), but we would also see a performance trade-off before reaching 64 unique entities.

What did you think about my solution? Let me know on the DUTC Discord. Talk to you all next week!