pandas: Months, Days, and Categoricals#
Hello, everyone! Welcome back to Cameron’s Corner. This week, I want to discuss
how we can leverage pandas.Categorical
arrays when working with calendar
months and weekdays. This is a bit of a longstanding issue I’ve
had with pandas. However, I am not the only one who has thought of this, so I have to respect the priorities of the core developers who contribute
their time to this project.
Premise#
We have some dates stored in a pandas.Series
, and, at some point during our
analytical pipeline, we need to work with individual months and/or weekdays.
We can readily extract the integer value that corresponds with each month
(where January ⇒ 1, December ⇒ 12), OR we can extract the string name of the
month. The same transformations are available for the day of the week(where
Monday ⇒ 0, Sunday ⇒ 6).
from pandas import DataFrame, date_range
df = DataFrame({
'date': [
*date_range('1980-01-01', periods=1_000_000, freq='min'),
]
})
df['month'] = df['date'].dt.month
df['month_name'] = df['date'].dt.month_name()
df['weekday'] = df['date'].dt.weekday
df['day_name'] = df['date'].dt.day_name()
df
date | month | month_name | weekday | day_name | |
---|---|---|---|---|---|
0 | 1980-01-01 00:00:00 | 1 | January | 1 | Tuesday |
1 | 1980-01-01 00:01:00 | 1 | January | 1 | Tuesday |
2 | 1980-01-01 00:02:00 | 1 | January | 1 | Tuesday |
3 | 1980-01-01 00:03:00 | 1 | January | 1 | Tuesday |
4 | 1980-01-01 00:04:00 | 1 | January | 1 | Tuesday |
... | ... | ... | ... | ... | ... |
999995 | 1981-11-25 10:35:00 | 11 | November | 2 | Wednesday |
999996 | 1981-11-25 10:36:00 | 11 | November | 2 | Wednesday |
999997 | 1981-11-25 10:37:00 | 11 | November | 2 | Wednesday |
999998 | 1981-11-25 10:38:00 | 11 | November | 2 | Wednesday |
999999 | 1981-11-25 10:39:00 | 11 | November | 2 | Wednesday |
1000000 rows × 5 columns
The problem that we encounter with this decision is that we need to either use the weekday/months as integers (thus requiring an interpretive layer to make sense of the values), or we can opt to work with the strings, which will result in much slower comparative operations and a larger memory footprint.
print(
df.dtypes,
df.memory_usage(deep=True),
sep='\n{}\n'.format('\N{box drawings light horizontal}' * 40)
)
date datetime64[ns]
month int32
month_name object
weekday int32
day_name object
dtype: object
────────────────────────────────────────
Index 128
date 8000000
month 4000000
month_name 63053120
weekday 4000000
day_name 64143840
dtype: int64
%timeit -n 1 -r 1 df['month'] == 1 # less readable, fast
%timeit -n 1 -r 1 df['month_name'] == 'January' # more readable, slow
603 µs ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
45.7 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
But, the fact is that we don’t really need to make this decision. We can have the best of both worlds because…
both the names of weekdays and the months in a year are finitely enumerable
we can rely on numeric values for comparative operations and extract the string values when necessary.
As it turns out, the pandas.Categorical
is designed to solve this exact problem,
so we should have one returned from Series.dt.month_name
and Series.dt.day_name
A brief aside on Categoricals (in pandas)#
Categoricals are incredibly useful when you have string data that contain a large number of duplicates. The fewer duplicate strings you have, the less useful Categoricals will be for you.
If we have a limited number of strings that we want to represent in some data, we can enumerate those strings. Then, from that enumeration, we can use integers to identify each of those unique strings.
from pandas import Series
s = Series(['a', 'b', 'c', 'a', 'a', 'b'])
codes, categories = s.factorize()
print(
f'{categories = }', # unique strings from `s`
f'{codes = }', # integers that positionally map catgories back to `s`
sep='\n'
)
categories = Index(['a', 'b', 'c'], dtype='object')
codes = array([0, 1, 2, 0, 0, 1])
These two pieces of information compose a Categorical, and we can use
pandas.Categorical
to create one of these special arrays from an input:
from pandas import Categorical
# Categorical takes in an array and factorizes it for the unique values and their
# locations in the inputted array.
Categorical(['a', 'a', 'b', 'c', 'd'])
['a', 'a', 'b', 'c', 'd']
Categories (4, object): ['a', 'b', 'c', 'd']
However, we often have data split across many sources, and we can run into problems
where a given source (whether we read a single .csv
or query a table in SQL)
may not have ALL of the unique categories. Therefore, we need to also have a way
that we can declare what all of the possible categories are for a given pandas.Categorical
.
# we can pass `categories` into the Categorical constructor
Categorical(['a', 'a', 'b', 'c'], categories=['a', 'b', 'c', 'd'])
['a', 'a', 'b', 'c']
Categories (4, object): ['a', 'b', 'c', 'd']
While the above is useful for one-off transformations, we may want to track the
unique categories in a more portable manner. This is the pandas.CategoricalDtype
.
With the CategoricalDtype
, we can also store some metadata alongside
the categories themselves (e.g., if they should be ordered or not).
from pandas import CategoricalDtype
# we can also enumerate our expected categories
# any value that is unspecified is replaced with a NaN
dtype = CategoricalDtype(['a', 'b', 'c'], ordered=True)
cat_arr = Categorical(['a', 'a', 'b', 'c', 'd'], dtype=dtype)
cat_arr
['a', 'a', 'b', 'c', NaN]
Categories (3, object): ['a' < 'b' < 'c']
# 0, 1, 2 reference the positions of the unique values
# -1 represents a NaN/missing value
print(
f'{cat_arr.codes = }',
f'{cat_arr.categories = }',
sep='\n'
)
cat_arr.codes = array([ 0, 0, 1, 2, -1], dtype=int8)
cat_arr.categories = Index(['a', 'b', 'c'], dtype='object')
Now, we’ve seen how to turn an array of strings into a Categorical, but all of the examples require us to have some type of string data in the first place. So, if we have a memory concern, we need to work with the full strings at some point in our pipelines.
Fortunately, there is a straightforward way for us to construct a pandas.Categorical
from some provided codes via the pandas.Categorical.from_codes
constructor.
codes = [0, 0, 1, 2, -1]
Categorical.from_codes(codes, dtype=dtype)
['a', 'a', 'b', 'c', NaN]
Categories (3, object): ['a' < 'b' < 'c']
This trick is going to be very useful for our application to months and days,
as it means that we don’t need to go through a temporary casting to strings
in order to construct a pandas.Categorical
.
Turning Months & Days into Categories#
Let’s go ahead and create some datatypes to represent our months and weekdays:
from calendar import day_name, month_name
WeekDayDtype = CategoricalDtype(day_name[:], ordered=True)
MonthDtype = CategoricalDtype(month_name[1:], ordered=True)
print(
f'{WeekDayDtype = !r}',
f'{MonthDtype = !r}',
sep='\n{}\n'.format('\N{box drawings light horizontal}' * 40)
)
WeekDayDtype = CategoricalDtype(categories=['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday',
'Saturday', 'Sunday'],
, ordered=True, categories_dtype=object)
────────────────────────────────────────
MonthDtype = CategoricalDtype(categories=['January', 'February', 'March', 'April', 'May', 'June',
'July', 'August', 'September', 'October', 'November',
'December'],
, ordered=True, categories_dtype=object)
Now we can grab the numeric codes from our dataframe. Remember that we stored
these codes in the 'weekday'
and the 'month'
columns. One small caveat here
is that our weekday column is 0-indexed (0 denotes Monday), whereas the
'month'
column is 1-indexed (0 is meaningless, 1 denotes January). We can use
a small correction to 0-index both of these arrays to align with what our Categorical
expects.
df['day_name_cat'] = Categorical.from_codes(df['weekday'], dtype=WeekDayDtype)
df['month_name_cat'] = Categorical.from_codes(df['month'] - 1, dtype=MonthDtype)
# ensure the Categoricals and their Object counterparts have the same values
assert (df['day_name'] == df['day_name_cat']).all()
assert (df['month_name'] == df['month_name_cat']).all()
df
date | month | month_name | weekday | day_name | day_name_cat | month_name_cat | |
---|---|---|---|---|---|---|---|
0 | 1980-01-01 00:00:00 | 1 | January | 1 | Tuesday | Tuesday | January |
1 | 1980-01-01 00:01:00 | 1 | January | 1 | Tuesday | Tuesday | January |
2 | 1980-01-01 00:02:00 | 1 | January | 1 | Tuesday | Tuesday | January |
3 | 1980-01-01 00:03:00 | 1 | January | 1 | Tuesday | Tuesday | January |
4 | 1980-01-01 00:04:00 | 1 | January | 1 | Tuesday | Tuesday | January |
... | ... | ... | ... | ... | ... | ... | ... |
999995 | 1981-11-25 10:35:00 | 11 | November | 2 | Wednesday | Wednesday | November |
999996 | 1981-11-25 10:36:00 | 11 | November | 2 | Wednesday | Wednesday | November |
999997 | 1981-11-25 10:37:00 | 11 | November | 2 | Wednesday | Wednesday | November |
999998 | 1981-11-25 10:38:00 | 11 | November | 2 | Wednesday | Wednesday | November |
999999 | 1981-11-25 10:39:00 | 11 | November | 2 | Wednesday | Wednesday | November |
1000000 rows × 7 columns
With this simple transformation, we have vastly reduced the footprint of our dataframe and can still interact with these entities by name, making our code both more readable and more efficient. Let’s take one final look at the search speed and the memory usage:
%timeit -n 1 -r 1 df['month'] == 1 # less readable, fast
%timeit -n 1 -r 1 df['month_name'] == 'January' # more readable, slow
%timeit -n 1 -r 1 df['month_name_cat'] == 'January' # more readable, fast!
790 µs ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
44.8 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
2.05 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
print(
df.dtypes,
df.memory_usage(deep=True).sort_index(),
sep='\n{}\n'.format('\N{box drawings light horizontal}' * 40)
)
date datetime64[ns]
month int32
month_name object
weekday int32
day_name object
day_name_cat category
month_name_cat category
dtype: object
────────────────────────────────────────
Index 128
date 8000000
day_name 64143840
day_name_cat 1000749
month 4000000
month_name 63053120
month_name_cat 1001058
weekday 4000000
dtype: int64
Wrap-Up#
Thanks for checking out my blog post this week! I hope you enjoyed this brief
overview of pandas.Categoricals
and its practical uses.
What do you think about pandas.Categoricals
? How do you think you’ll implement it in your code? Let me know on the DUTC Discord server.
Talk to you all next week!