pandas: Months, Days, and Categoricals#

Hello, everyone! Welcome back to Cameron’s Corner. This week, I want to discuss how we can leverage pandas.Categorical arrays when working with calendar months and weekdays. This is a bit of a longstanding issue I’ve had with pandas. However, I am not the only one who has thought of this, so I have to respect the priorities of the core developers who contribute their time to this project.

Premise#

We have some dates stored in a pandas.Series, and, at some point during our analytical pipeline, we need to work with individual months and/or weekdays. We can readily extract the integer value that corresponds with each month (where January ⇒ 1, December ⇒ 12), OR we can extract the string name of the month. The same transformations are available for the day of the week(where Monday ⇒ 0, Sunday ⇒ 6).

from pandas import DataFrame, date_range

df = DataFrame({
    'date': [
        *date_range('1980-01-01', periods=1_000_000, freq='min'),
    ]
})

df['month']      = df['date'].dt.month
df['month_name'] = df['date'].dt.month_name()
df['weekday']    = df['date'].dt.weekday
df['day_name']   = df['date'].dt.day_name()


df
date month month_name weekday day_name
0 1980-01-01 00:00:00 1 January 1 Tuesday
1 1980-01-01 00:01:00 1 January 1 Tuesday
2 1980-01-01 00:02:00 1 January 1 Tuesday
3 1980-01-01 00:03:00 1 January 1 Tuesday
4 1980-01-01 00:04:00 1 January 1 Tuesday
... ... ... ... ... ...
999995 1981-11-25 10:35:00 11 November 2 Wednesday
999996 1981-11-25 10:36:00 11 November 2 Wednesday
999997 1981-11-25 10:37:00 11 November 2 Wednesday
999998 1981-11-25 10:38:00 11 November 2 Wednesday
999999 1981-11-25 10:39:00 11 November 2 Wednesday

1000000 rows × 5 columns

The problem that we encounter with this decision is that we need to either use the weekday/months as integers (thus requiring an interpretive layer to make sense of the values), or we can opt to work with the strings, which will result in much slower comparative operations and a larger memory footprint.

print(
    df.dtypes,
    df.memory_usage(deep=True),
    sep='\n{}\n'.format('\N{box drawings light horizontal}' * 40)
)
date          datetime64[ns]
month                  int32
month_name            object
weekday                int32
day_name              object
dtype: object
────────────────────────────────────────
Index              128
date           8000000
month          4000000
month_name    63053120
weekday        4000000
day_name      64143840
dtype: int64
%timeit -n 1 -r 1 df['month'] == 1              # less readable, fast
%timeit -n 1 -r 1 df['month_name'] == 'January' # more readable, slow
603 µs ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
45.7 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

But, the fact is that we don’t really need to make this decision. We can have the best of both worlds because…

  1. both the names of weekdays and the months in a year are finitely enumerable

  2. we can rely on numeric values for comparative operations and extract the string values when necessary.

As it turns out, the pandas.Categorical is designed to solve this exact problem, so we should have one returned from Series.dt.month_name and Series.dt.day_name

A brief aside on Categoricals (in pandas)#

Categoricals are incredibly useful when you have string data that contain a large number of duplicates. The fewer duplicate strings you have, the less useful Categoricals will be for you.

If we have a limited number of strings that we want to represent in some data, we can enumerate those strings. Then, from that enumeration, we can use integers to identify each of those unique strings.

from pandas import Series

s = Series(['a', 'b', 'c', 'a', 'a', 'b'])
codes, categories = s.factorize()

print(
    f'{categories = }', # unique strings from `s`
    f'{codes      = }', # integers that positionally map catgories back to `s`
    sep='\n'
)
categories = Index(['a', 'b', 'c'], dtype='object')
codes      = array([0, 1, 2, 0, 0, 1])

These two pieces of information compose a Categorical, and we can use pandas.Categorical to create one of these special arrays from an input:

from pandas import Categorical

# Categorical takes in an array and factorizes it for the unique values and their
#   locations in the inputted array.
Categorical(['a', 'a', 'b', 'c', 'd'])
['a', 'a', 'b', 'c', 'd']
Categories (4, object): ['a', 'b', 'c', 'd']

However, we often have data split across many sources, and we can run into problems where a given source (whether we read a single .csv or query a table in SQL) may not have ALL of the unique categories. Therefore, we need to also have a way that we can declare what all of the possible categories are for a given pandas.Categorical.

# we can pass `categories` into the Categorical constructor
Categorical(['a', 'a', 'b', 'c'], categories=['a', 'b', 'c', 'd'])
['a', 'a', 'b', 'c']
Categories (4, object): ['a', 'b', 'c', 'd']

While the above is useful for one-off transformations, we may want to track the unique categories in a more portable manner. This is the pandas.CategoricalDtype. With the CategoricalDtype, we can also store some metadata alongside the categories themselves (e.g., if they should be ordered or not).

from pandas import CategoricalDtype

# we can also enumerate our expected categories
#   any value that is unspecified is replaced with a NaN
dtype = CategoricalDtype(['a', 'b', 'c'], ordered=True)
cat_arr = Categorical(['a', 'a', 'b', 'c', 'd'], dtype=dtype)
cat_arr
['a', 'a', 'b', 'c', NaN]
Categories (3, object): ['a' < 'b' < 'c']
# 0, 1, 2 reference the positions of the unique values
# -1 represents a NaN/missing value

print(
    f'{cat_arr.codes      = }',
    f'{cat_arr.categories = }',
    sep='\n'
)
cat_arr.codes      = array([ 0,  0,  1,  2, -1], dtype=int8)
cat_arr.categories = Index(['a', 'b', 'c'], dtype='object')

Now, we’ve seen how to turn an array of strings into a Categorical, but all of the examples require us to have some type of string data in the first place. So, if we have a memory concern, we need to work with the full strings at some point in our pipelines.

Fortunately, there is a straightforward way for us to construct a pandas.Categorical from some provided codes via the pandas.Categorical.from_codes constructor.

codes = [0, 0, 1, 2, -1]
Categorical.from_codes(codes, dtype=dtype)
['a', 'a', 'b', 'c', NaN]
Categories (3, object): ['a' < 'b' < 'c']

This trick is going to be very useful for our application to months and days, as it means that we don’t need to go through a temporary casting to strings in order to construct a pandas.Categorical.

Turning Months & Days into Categories#

Let’s go ahead and create some datatypes to represent our months and weekdays:

from calendar import day_name, month_name
WeekDayDtype = CategoricalDtype(day_name[:], ordered=True)
MonthDtype   = CategoricalDtype(month_name[1:], ordered=True)

print(
    f'{WeekDayDtype = !r}',
    f'{MonthDtype   = !r}',
    sep='\n{}\n'.format('\N{box drawings light horizontal}' * 40)
)
WeekDayDtype = CategoricalDtype(categories=['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday',
                  'Saturday', 'Sunday'],
, ordered=True, categories_dtype=object)
────────────────────────────────────────
MonthDtype   = CategoricalDtype(categories=['January', 'February', 'March', 'April', 'May', 'June',
                  'July', 'August', 'September', 'October', 'November',
                  'December'],
, ordered=True, categories_dtype=object)

Now we can grab the numeric codes from our dataframe. Remember that we stored these codes in the 'weekday' and the 'month' columns. One small caveat here is that our weekday column is 0-indexed (0 denotes Monday), whereas the 'month' column is 1-indexed (0 is meaningless, 1 denotes January). We can use a small correction to 0-index both of these arrays to align with what our Categorical expects.

df['day_name_cat'] = Categorical.from_codes(df['weekday'], dtype=WeekDayDtype)
df['month_name_cat'] = Categorical.from_codes(df['month'] - 1, dtype=MonthDtype)

# ensure the Categoricals and their Object counterparts have the same values
assert (df['day_name']   == df['day_name_cat']).all()
assert (df['month_name'] == df['month_name_cat']).all()

df
date month month_name weekday day_name day_name_cat month_name_cat
0 1980-01-01 00:00:00 1 January 1 Tuesday Tuesday January
1 1980-01-01 00:01:00 1 January 1 Tuesday Tuesday January
2 1980-01-01 00:02:00 1 January 1 Tuesday Tuesday January
3 1980-01-01 00:03:00 1 January 1 Tuesday Tuesday January
4 1980-01-01 00:04:00 1 January 1 Tuesday Tuesday January
... ... ... ... ... ... ... ...
999995 1981-11-25 10:35:00 11 November 2 Wednesday Wednesday November
999996 1981-11-25 10:36:00 11 November 2 Wednesday Wednesday November
999997 1981-11-25 10:37:00 11 November 2 Wednesday Wednesday November
999998 1981-11-25 10:38:00 11 November 2 Wednesday Wednesday November
999999 1981-11-25 10:39:00 11 November 2 Wednesday Wednesday November

1000000 rows × 7 columns

With this simple transformation, we have vastly reduced the footprint of our dataframe and can still interact with these entities by name, making our code both more readable and more efficient. Let’s take one final look at the search speed and the memory usage:

%timeit -n 1 -r 1 df['month'] == 1                  # less readable, fast
%timeit -n 1 -r 1 df['month_name'] == 'January'     # more readable, slow
%timeit -n 1 -r 1 df['month_name_cat'] == 'January' # more readable, fast!
790 µs ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
44.8 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
2.05 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
print(
    df.dtypes,
    df.memory_usage(deep=True).sort_index(),
    sep='\n{}\n'.format('\N{box drawings light horizontal}' * 40)
)
date              datetime64[ns]
month                      int32
month_name                object
weekday                    int32
day_name                  object
day_name_cat            category
month_name_cat          category
dtype: object
────────────────────────────────────────
Index                  128
date               8000000
day_name          64143840
day_name_cat       1000749
month              4000000
month_name        63053120
month_name_cat     1001058
weekday            4000000
dtype: int64

Wrap-Up#

Thanks for checking out my blog post this week! I hope you enjoyed this brief overview of pandas.Categoricals and its practical uses.

What do you think about pandas.Categoricals? How do you think you’ll implement it in your code? Let me know on the DUTC Discord server.

Talk to you all next week!