A FlagEnum Categorical in pandas#
Hi all, welcome back to Cameron’s Corner! This week, I want to explore the encoding of combinatoric sets (from a
limited pool) inside a pandas.DataFrame
. In more colloquial terms, I want
to explore the following example:
We have a catalog of programming articles & videos (entities).
Each entity is associated with descriptive tags (e.g., Python, pandas, Matplotlib).
How can we use a
pandas.Categorical
to encode all unique combinations of a given set of tags?
Before we get started, I want to let you know about the FINAL session of our seminar series, “(Even More) Python Basics for Experts.” Join James in covering (even more) Python basics that any aspiring Python expert needs to know in order to make their code more effective and efficient. He’ll tackle what’s real, how we can tell it’s real, and how we can do less work.
Back to the topic at hand!
Typically, this problem is solved in a database by creating a secondary table that exhibits
a 1:many relationship between entities and tags. However, for the purposes of this
explorative example, I wanted to use DataFrames
to encode this information without
the need for an explicit secondary data structure.
Note that the below ideas are to provide discussion and are not a recommendation that anyone encodes their data in this manner as it likely will not scale well.
Before we dive into pandas, let’s talk about how we can represent these ideas in pure Python.
Python’s Flag Enum#
In Python, we can use enum.Enum
to represent a limited set of unique members. For our purposes, we are going to use
enum.Flag
, which encompasses
enum.Enum
’s behavior with an important feature: the ability to uniquely encode
combinations of its members.
Let’s first focus on the basic enum capabilities.
from enum import Enum, Flag
EnumTags = Enum('EnumTags', 'python pandas matplotlib')
FlagTags = Flag('FlagTags', 'python pandas matplotlib')
display(
[*EnumTags],
[*FlagTags],
)
[<EnumTags.python: 1>, <EnumTags.pandas: 2>, <EnumTags.matplotlib: 3>]
[<FlagTags.python: 1>, <FlagTags.pandas: 2>, <FlagTags.matplotlib: 4>]
Take note of the values adjacent to each of the enum.Enum
members in the above output.
Notice that the standard enum.Enum
values increment by one for each member: 1 → 2 → 3
.
Whereas, the enum.Flag
values increment from 1 → 2 → 4
. Where do these values come from?
Well, in any enum.Enum
subclass, these values are created by Enum._generate_next_value
.
For enum.Flag
, our members follow the pattern 2**i
, where i
is each value from 0 to len(members)
.
We can take a look at the source code to verify this:
from inspect import getsource
# from this point forward we will only use enum.Flag
Tags = Flag('Tags', 'python pandas matplotlib')
print(getsource(Tags._generate_next_value_))
def _generate_next_value_(name, start, count, last_values):
"""
Generate the next value when not given.
name: the name of the member
start: the initial start value or None
count: the number of existing members
last_value: the last value assigned or None
"""
if not count:
return start if start is not None else 1
for last_value in reversed(last_values):
try:
high_bit = _high_bit(last_value)
break
except Exception:
raise TypeError('Invalid Flag value: %r' % last_value) from None
return 2 ** (high_bit+1)
We can even recreate these values using the above logic with NumPy quite readily:
from numpy import arange
2 ** arange(len(Tags))
array([1, 2, 4])
Integers and Bits#
The reason for using the above pattern is because we can use bytearrays
to
represent our unique members. If I were to decompose these integers into their
corresponding binary representations, we end up with a pattern that is a 1:1 match
of a dummy coding of our tags.
from pandas import DataFrame
DataFrame(
columns=['python', 'pandas', 'matplotlib'],
data=[
[0, 0, 1], # python 1
[0, 1, 0], # pandas 2
[1, 0, 0], # matplotlib 4
]
)
python | pandas | matplotlib | |
---|---|---|---|
0 | 0 | 0 | 1 |
1 | 0 | 1 | 0 |
2 | 1 | 0 | 0 |
The Flag part of enum.Flag#
While it is interesting how Python generates integer values that back the enum.Flag
,
the more interesting part is that this pattern enables us to combine enum members.
In the above DataFrame we can easily encode a value which contains 'python', 'matplotlib'
.
from pandas import DataFrame
DataFrame(
columns=['python', 'pandas', 'matplotlib'],
data=[
[0, 1, 1], # pandas | matplotlib 3
[1, 0, 1], # python | matplotlib 5
]
)
python | pandas | matplotlib | |
---|---|---|---|
0 | 0 | 1 | 1 |
1 | 1 | 0 | 1 |
In fact, this is exactly what Flag
does under the hood. We can readily combine
its members:
print(
f'{Tags.python!r}',
f'{Tags.python | Tags.pandas!r}',
f'{Tags.python | Tags.matplotlib!r}',
f'{Tags.matplotlib | Tags.python!r}', # same as python|matplotlib
f'{Tags.matplotlib | Tags.pandas!r}', # same as python|matplotlib
f'{Tags.python | Tags.pandas | Tags.matplotlib!r}',
sep='\n'
)
<Tags.python: 1>
<Tags.pandas|python: 3>
<Tags.matplotlib|python: 5>
<Tags.matplotlib|python: 5>
<Tags.matplotlib|pandas: 6>
<Tags.matplotlib|pandas|python: 7>
As an aside, if you’ve used Python’s re
module and have had to pass multiple flags, then you’ve
actually already used this feature.
Notice that I can combine members of our tags—for our purposes, this feature is a very convenient behavior because any entity we have can have be associated with 0 or more tags. With this pattern we can represent ANY combination of tag members as an integer while preserving the entirety of the information conveyed by that tag-set.
All Flag Options#
Let’s ditch the Python implementation and instead use NumPy/pandas to represent this same type of logic. We’ve already seen a short example above, but I’ll enumerate all possibilities for a tag-set with a size of three.
from numpy import arange
from pandas import DataFrame
df = DataFrame(
columns=['matplotlib', 'pandas', 'python'],
data=[
[0, 0, 0], # None 0
[0, 0, 1], # python 1
[0, 1, 0], # pandas 2
[0, 1, 1], # python|pandas 3
[1, 0, 0], # mpl 4
[1, 0, 1], # mpl 5
[1, 1, 0], # python|pandas 6
[1, 1, 1], # python|pandas|mpl 7
]
)
# we can convert the binary representations to integers
df['encoded'] = df @ (2 ** arange(df.columns.size)[::-1])
df
matplotlib | pandas | python | encoded | |
---|---|---|---|---|
0 | 0 | 0 | 0 | 0 |
1 | 0 | 0 | 1 | 1 |
2 | 0 | 1 | 0 | 2 |
3 | 0 | 1 | 1 | 3 |
4 | 1 | 0 | 0 | 4 |
5 | 1 | 0 | 1 | 5 |
6 | 1 | 1 | 0 | 6 |
7 | 1 | 1 | 1 | 7 |
But, what if we don’t want to write out all of those bit permutations by hand? Well, it turns out we can leverage a little NumPy to generate all of them for us:
from numpy import arange, zeros
def bitwise_permutations(n):
num_permutations = 2 ** n
indices = arange(num_permutations, dtype='uint64')
bitwise_array = zeros((num_permutations, n), dtype='uint8')
for i in range(n):
bitwise_array[:, n - i - 1] = (indices >> i) & 1
return bitwise_array
bitwise_permutations(len(Tags))
array([[0, 0, 0],
[0, 0, 1],
[0, 1, 0],
[0, 1, 1],
[1, 0, 0],
[1, 0, 1],
[1, 1, 0],
[1, 1, 1]], dtype=uint8)
Now we can construct our Tags DataFrame quite easily:
from enum import Flag
from pandas import DataFrame
Tags = Flag('Tags', 'python pandas matplotlib')
df = DataFrame(
columns=[t.name for t in reversed(Tags)],
data=bitwise_permutations(len(Tags)),
).astype('uint8')
df = df.assign(
# note that the dot product below can be replaced with arange(len(bitcodes))
# because bitwise_permutations generates these values in order.
# I wanted to demonstrate the coupling between these entities.
encoded=df @ (2 ** arange(df.columns.size)[::-1]),
# create human-readable representations (1 → python, 3 → pandas|python)
human=[
','.join(df.columns[row]) for row in df.to_numpy().astype(bool)
]
)
df
matplotlib | pandas | python | encoded | human | |
---|---|---|---|---|---|
0 | 0 | 0 | 0 | 0 | |
1 | 0 | 0 | 1 | 1 | python |
2 | 0 | 1 | 0 | 2 | pandas |
3 | 0 | 1 | 1 | 3 | pandas,python |
4 | 1 | 0 | 0 | 4 | matplotlib |
5 | 1 | 0 | 1 | 5 | matplotlib,python |
6 | 1 | 1 | 0 | 6 | matplotlib,pandas |
7 | 1 | 1 | 1 | 7 | matplotlib,pandas,python |
Now that we have an enumeration of all possible tags and their combinations (backed
by numbers 0:2**N
), we can represent this information as a pandas.CategoricalDtype
.
from pandas import CategoricalDtype
FlagDtype = CategoricalDtype(categories=df['human'])
FlagDtype
CategoricalDtype(categories=['', 'python', 'pandas', 'pandas,python', 'matplotlib',
'matplotlib,python', 'matplotlib,pandas',
'matplotlib,pandas,python'],
, ordered=False, categories_dtype=object)
You’re probably thinking that this is a lot of work just to create a pandas.CategoricalDtype
.
But, let’s take this to an applied example to see how representing tag-sets with
as a pandas.CategoricalDtype
can vastly improve our memory footprint and lookup speed compared to operating
along equivalent object sets (tuples or strings).
Applied Example#
Let’s start with some string data representing various tags that are associated with an arbitrary number of records/entities.
from enum import Flag
from pandas import Series
from numpy.random import default_rng
from numpy.ma import masked_array
rng = default_rng(1)
Tags = Flag('Tags', 'python pandas matplotlib')
tag_pool = [t.name for t in Tags]
# starting from the human representation
tag_series = Series([
','.join(
sorted(
rng.choice(
tag_pool,
size=rng.integers(len(Tags)),
replace=False
)
)
)
for _ in range(100_000)
])
tag_series
0 pandas
1 pandas,python
2 pandas,python
3 matplotlib,python
4
...
99995 matplotlib
99996 pandas
99997 matplotlib
99998
99999 matplotlib
Length: 100000, dtype: object
The first thing we need to do is transform these strings into a sparse representation
of the entities they contain. In pandas, we can readily do this with Series.str.get_dummies
(remember when I said
this pattern is a 1:1 match with multi-value dummy coding?).
bits = tag_series.str.get_dummies(sep=',').astype('uint8')
bits.head()
matplotlib | pandas | python | |
---|---|---|---|
0 | 0 | 1 | 0 |
1 | 0 | 1 | 1 |
2 | 0 | 1 | 1 |
3 | 1 | 0 | 1 |
4 | 0 | 0 | 0 |
Now we can readily create the combinations of all categories by using the information contained entirely with the column of the above DataFrame.
from enum import Flag
from pandas import DataFrame, Categorical, CategoricalDtype
from numpy import arange
# Create bitcode for all possible tag combinations
perm_bitcodes = bitwise_permutations(bits.columns.size)
perms_names = [','.join(bits.columns[row]) for row in perm_bitcodes.astype(bool)]
FlagDtype = CategoricalDtype(categories=perms_names)
# Create integers from the observed bits
intcodes = bits @ (2 ** arange(bits.columns.size)[::-1])
result = bits.assign(
raw=tag_series,
encoded=Categorical.from_codes(intcodes, dtype=FlagDtype),
)
display(
result,
result
.dtypes.to_frame('dtypes')
.assign(memory=result.memory_usage(deep=True))
)
matplotlib | pandas | python | raw | encoded | |
---|---|---|---|---|---|
0 | 0 | 1 | 0 | pandas | pandas |
1 | 0 | 1 | 1 | pandas,python | pandas,python |
2 | 0 | 1 | 1 | pandas,python | pandas,python |
3 | 1 | 0 | 1 | matplotlib,python | matplotlib,python |
4 | 0 | 0 | 0 | ||
... | ... | ... | ... | ... | ... |
99995 | 1 | 0 | 0 | matplotlib | matplotlib |
99996 | 0 | 1 | 0 | pandas | pandas |
99997 | 1 | 0 | 0 | matplotlib | matplotlib |
99998 | 0 | 0 | 0 | ||
99999 | 1 | 0 | 0 | matplotlib | matplotlib |
100000 rows × 5 columns
dtypes | memory | |
---|---|---|
matplotlib | uint8 | 100000 |
pandas | uint8 | 100000 |
python | uint8 | 100000 |
raw | object | 6465503 |
encoded | category | 100849 |
Since we are using a pandas.Categorical
to hold this information, we would expect to see
large benefits to our memory footprint as well as comparative operations. Furthermore,
by using the underlying bit patterns, we can readily perform set-like comparisons
(but I’ll save this idea for a different blog post).
%timeit -n1 -r1 result['raw'] == 'pandas,python'
%timeit -n1 -r1 result['encoded'] == 'pandas,python'
# both result sets are equivalent
assert (
((result['raw'] == 'pandas,python') == (result['encoded'] == 'pandas,python'))
.all()
)
5.04 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
249 µs ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
To-Do#
While we still have a ways to go to create an ergonomic enum.Flag
Categorical
datatype in pandas, we have the core pieces of logic. The next steps would involve
working with intermediated values within a pandas.array
, similar to how datetime
arrays are full of pandas.Timestamps:
from pandas import array, to_datetime
arr = array(to_datetime(['2000-01-01', '2000-01-02']))
print(
arr,
f"{arr[0] = }",
sep='\n{}\n'.format(f'\N{box drawings light horizontal}' * 40)
)
<DatetimeArray>
['2000-01-01 00:00:00', '2000-01-02 00:00:00']
Length: 2, dtype: datetime64[ns]
────────────────────────────────────────
arr[0] = Timestamp('2000-01-01 00:00:00')
By using a pandas.api.extensions.ExtensionArray
alongside an intermediated value, we can set up ergonomic
queries for various setlike operations such as checking exact equality or subsets/supersets.
Wrap-Up#
That was a lot of pandas! While we solved the computational part of this problem, there is still a lot more to implement. Of course, we could also further decompose this into a separate table with a 1:m relationship which would have a favorable trade-off if you exceed 64 unique tags (as this would exceed the maximum size of our encoding here), but we would also see a performance trade-off before reaching 64 unique entities.
What did you think about my solution? Let me know on the DUTC Discord. Talk to you all next week!