Dealing With Dates in Pandas - Part 1#

So how do we work with dates and times in pandas? Well if we need to ensure our operations are as performant as possible we’ll need to reach into pandas restricted computation domain, and that means using its objects and playing by its rules.

Fortunately, the metaphors we’ve discussed about date times along the way still hold

  • point-in-time: a representation of a specific moment

  • span-of-time: a representation of a moment with a duration

In pandas we use the pandas.Timestamp and pandas.Period to represent our metaphors respectively.

Why pandas.Timestamp and not datetime.datetime?#

Earlier I mentioned that pandas is a restricted computation domain, which means that we can write high level Python code, which leverages other optimized routines to perform our actual calculations.

One of the optimizations that pandas makes when handling datetimes is to represent all datetimes as a single integer that is the number of nanoseconds offset from the epoch (1970-01-01 00:00:00).

from pandas import Timestamp

Timestamp(0) # create a Timestamp object represented by the integer 0
Timestamp('1970-01-01 00:00:00')
print(  # pandas can guess the format of *most* datetimes
    Timestamp('2023-01-18'),
    Timestamp('01/18/23'),
    sep='\n'
)
2023-01-18 00:00:00
2023-01-18 00:00:00

Since the underlying representation of time is an integer, pandas uses an array of integer values to represent these times. By using an underlying integer array, pandas can quickly apply mathematical operations to our datetimes.

Datetime delta in pandas#

Similar to what we saw in Pythons datetime.timedelta pandas has an equivalent pandas.Timedelta object that is used for performing arithmetic with dates.

from pandas import Timedelta

start = Timestamp("2023-01-18")
delta = Timedelta(4, unit='day') # 4 day delta

print(
    f"{start         = }",
    f"{delta         = }",
    f"{start + delta = }",
    sep='\n',
)
start         = Timestamp('2023-01-18 00:00:00')
delta         = Timedelta('4 days 00:00:00')
start + delta = Timestamp('2023-01-22 00:00:00')

However we are often faced with real-world constraints when working with datetimes. For example, when working with datetimes that encompass a daylight savings transition, if I want to increment 1 day beyond a date where a timezone transition- should I end up 24 hours after I started? Or should I simple increase the day part of my date by 1?

Consider the following, knowing that on Nov 06, 2022 the United States rolled their clocks back 1 hour at 02:00am local time.

from pandas import Timestamp, Timedelta, DateOffset
ts = Timestamp("2022-11-06", tz="US/Pacific")

print(
    f"{ts + Timedelta(days=1) = }", # increments time by literal 24 hours
    f"{ts + DateOffset(days=1) = }", # increments day portion of date by 1
    sep="\n"
)
ts + Timedelta(days=1) = Timestamp('2022-11-06 23:00:00-0800', tz='US/Pacific')
ts + DateOffset(days=1) = Timestamp('2022-11-07 00:00:00-0800', tz='US/Pacific')

The DateOffset is how pandas implements calendar arithmetic, and helps us account for changes in local time zones. Importantly, the idea of calendar arithmetic helps us solve real-world problems. If instead of “adding 7 days” to our start date, what if we wanted to add “7 business days” (only counting Monday-Friday). The Timedelta is not flexible enough to perform this type of logic, instead we reach for a BusinessDay offset (which is built on top of the DateOffset class.

from pandas import Timestamp, Timedelta
from pandas.tseries.offsets import BusinessDay
ts = Timestamp("2022-11-03")

print(
    f"{ts                     =: %A, %Y-%m-%d}",
    
    # Adding 2 days to Thursday get us to Saturday
    f"{ts + Timedelta(days=2) =: %A, %Y-%m-%d}",
    
    # Saturday is not a valid Businessday, roll to following Monday
    f"{ts + BusinessDay(2)    =: %A, %Y-%m-%d }", 
    sep="\n"
)
ts                     = Thursday, 2022-11-03
ts + Timedelta(days=2) = Saturday, 2022-11-05
ts + BusinessDay(2)    = Monday, 2022-11-07 

In the cell above we’ve demonstrated a powerful idea of DateOffsets, the idea of a week mask (which days of the week count arithmetically) and rolling forward. Rolling forward occurs when we perform our date arithmetic and we end on an invalid date (such as a Saturday when we are only considering Business Days).

The week mask for the BusinessDay is implicitly locked to Monday-Friday, however pandas does expose some flexibility for implementing your own BusinessDay logic with the CustomBusinessDay object.

from pandas import Timestamp, Timedelta
from pandas.tseries.offsets import BusinessDay, CustomBusinessDay

ts = Timestamp("2022-11-03")
weekmask = "Tue Wed Thu Fri Sat Sun" # work every day except for Monday

print( 
    f"{ts                                           =: %A, %Y-%m-%d}",
    
    # Adding 4 days to Thursday get us to Monday
    f"{ts + Timedelta(days=4)                       =: %A, %Y-%m-%d}",
    
    # Not counting Saturday or Sunday, we end up on Wednesday
    f"{ts + BusinessDay(4)                          =: %A, %Y-%m-%d }", 
    
    # Not counting Monday, we end up on Tuesday with this weekmask
    f"{ts + CustomBusinessDay(4, weekmask=weekmask) =: %A, %Y-%m-%d }", 
    sep="\n"
)
ts                                           = Thursday, 2022-11-03
ts + Timedelta(days=4)                       = Monday, 2022-11-07
ts + BusinessDay(4)                          = Wednesday, 2022-11-09 
ts + CustomBusinessDay(4, weekmask=weekmask) = Tuesday, 2022-11-08 

Wrap Up#

There you have it! We covered some of the date time functionality that pandas offers, some implementation details, and most importantly the broader concepts of points vs spans of time and datetime arithmetic vs calendar aritmetic.

Pandas has a lot more functionality to offer when covering its date time support, so you can expect to hear more on pandas and datetimes next week! Talk to you all again then.