Dealing With Dates in Pandas - Part 1#
So how do we work with dates and times in pandas
? Well if we need to ensure our
operations are as performant as possible we’ll need to reach into
pandas
restricted computation domain, and that means using its objects and
playing by its rules.
Fortunately, the metaphors we’ve discussed about date times along the way still hold
point-in-time: a representation of a specific moment
span-of-time: a representation of a moment with a duration
In pandas
we use the pandas.Timestamp
and pandas.Period
to represent our
metaphors respectively.
Why pandas.Timestamp
and not datetime.datetime
?#
Earlier I mentioned that pandas
is a restricted computation domain, which
means that we can write high level Python code, which leverages other
optimized routines to perform our actual calculations.
One of the optimizations that pandas
makes when handling datetimes
is to
represent all datetimes
as a single integer that is the number of nanoseconds
offset from the epoch (1970-01-01 00:00:00).
from pandas import Timestamp
Timestamp(0) # create a Timestamp object represented by the integer 0
Timestamp('1970-01-01 00:00:00')
print( # pandas can guess the format of *most* datetimes
Timestamp('2023-01-18'),
Timestamp('01/18/23'),
sep='\n'
)
2023-01-18 00:00:00
2023-01-18 00:00:00
Since the underlying representation of time is an integer, pandas
uses an
array of integer values to represent these times. By using an underlying integer
array, pandas
can quickly apply mathematical operations to our datetimes
.
Datetime delta in pandas#
Similar to what we saw in Pythons datetime.timedelta
pandas has an equivalent
pandas.Timedelta
object that is used for performing arithmetic with dates.
from pandas import Timedelta
start = Timestamp("2023-01-18")
delta = Timedelta(4, unit='day') # 4 day delta
print(
f"{start = }",
f"{delta = }",
f"{start + delta = }",
sep='\n',
)
start = Timestamp('2023-01-18 00:00:00')
delta = Timedelta('4 days 00:00:00')
start + delta = Timestamp('2023-01-22 00:00:00')
However we are often faced with real-world constraints when working with datetimes. For example, when working with datetimes that encompass a daylight savings transition, if I want to increment 1 day beyond a date where a timezone transition- should I end up 24 hours after I started? Or should I simple increase the day part of my date by 1?
Consider the following, knowing that on Nov 06, 2022 the United States rolled their clocks back 1 hour at 02:00am local time.
from pandas import Timestamp, Timedelta, DateOffset
ts = Timestamp("2022-11-06", tz="US/Pacific")
print(
f"{ts + Timedelta(days=1) = }", # increments time by literal 24 hours
f"{ts + DateOffset(days=1) = }", # increments day portion of date by 1
sep="\n"
)
ts + Timedelta(days=1) = Timestamp('2022-11-06 23:00:00-0800', tz='US/Pacific')
ts + DateOffset(days=1) = Timestamp('2022-11-07 00:00:00-0800', tz='US/Pacific')
The DateOffset
is how pandas
implements calendar arithmetic, and helps us account
for changes in local time zones. Importantly, the idea of calendar arithmetic
helps us solve real-world problems. If instead of “adding 7 days” to our start date,
what if we wanted to add “7 business days” (only counting Monday-Friday). The
Timedelta
is not flexible enough to perform this type of logic, instead we reach for
a BusinessDay
offset (which is built on top of the DateOffset
class.
from pandas import Timestamp, Timedelta
from pandas.tseries.offsets import BusinessDay
ts = Timestamp("2022-11-03")
print(
f"{ts =: %A, %Y-%m-%d}",
# Adding 2 days to Thursday get us to Saturday
f"{ts + Timedelta(days=2) =: %A, %Y-%m-%d}",
# Saturday is not a valid Businessday, roll to following Monday
f"{ts + BusinessDay(2) =: %A, %Y-%m-%d }",
sep="\n"
)
ts = Thursday, 2022-11-03
ts + Timedelta(days=2) = Saturday, 2022-11-05
ts + BusinessDay(2) = Monday, 2022-11-07
In the cell above we’ve demonstrated a powerful idea of DateOffset
s, the idea of
a week mask (which days of the week count arithmetically) and rolling forward.
Rolling forward occurs when we perform our date arithmetic and we end on an invalid date
(such as a Saturday when we are only considering Business Days).
The week mask for the BusinessDay
is implicitly locked to Monday-Friday, however
pandas
does expose some flexibility for implementing your own BusinessDay
logic
with the CustomBusinessDay
object.
from pandas import Timestamp, Timedelta
from pandas.tseries.offsets import BusinessDay, CustomBusinessDay
ts = Timestamp("2022-11-03")
weekmask = "Tue Wed Thu Fri Sat Sun" # work every day except for Monday
print(
f"{ts =: %A, %Y-%m-%d}",
# Adding 4 days to Thursday get us to Monday
f"{ts + Timedelta(days=4) =: %A, %Y-%m-%d}",
# Not counting Saturday or Sunday, we end up on Wednesday
f"{ts + BusinessDay(4) =: %A, %Y-%m-%d }",
# Not counting Monday, we end up on Tuesday with this weekmask
f"{ts + CustomBusinessDay(4, weekmask=weekmask) =: %A, %Y-%m-%d }",
sep="\n"
)
ts = Thursday, 2022-11-03
ts + Timedelta(days=4) = Monday, 2022-11-07
ts + BusinessDay(4) = Wednesday, 2022-11-09
ts + CustomBusinessDay(4, weekmask=weekmask) = Tuesday, 2022-11-08
Wrap Up#
There you have it! We covered some of the date time functionality that pandas
offers, some implementation details, and most importantly the broader concepts
of points vs spans of time and datetime arithmetic vs calendar aritmetic.
Pandas has a lot more functionality to offer when covering its date time support,
so you can expect to hear more on pandas
and datetimes next week! Talk to you
all again then.