Dealing With Dates in Pandas - Part 2#
In my previous post, we discussed how we can approach date times in pandas
as well
as the metaphors used by the library and the differences between absolute time and
calendar time (also referred to as relative time).
This week, we’ll dive a little bit deeper into the functionality that pandas
has
to offer when dealing with time series data, covering topics like:
parsing date times
date time-aware selections
align date time data from different frequencies
Parsing date times#
Text-based file formats are one of the most common forms of data storage. Unfortunately,
this means that all of our data is also stored as text. This means our dates need to be
converted to text and away from their numeric-based representation. In order to work
with our dates effectively, we need to parse them back into their appropriate pandas
representation.
from pandas import read_fwf
from io import StringIO
buf = StringIO('''
date data
2023-jan-01 58
2023-mar-25 59
2023-nov-18 60
'''.strip('\n'))
df = read_fwf(buf)
df.dtypes # note the object dtype, which hints that this is a column of strings
date object
data int64
dtype: object
from pandas import to_datetime
print(
to_datetime(df['date']), # pandas is very good at guessing datetime format
to_datetime(df['date'], format='%Y-%b-%d'), # though we can pass the format ourselves
sep='\n{}\n'.format('\N{box drawings light horizontal}' * 40)
)
0 2023-01-01
1 2023-03-25
2 2023-11-18
Name: date, dtype: datetime64[ns]
────────────────────────────────────────
0 2023-01-01
1 2023-03-25
2 2023-11-18
Name: date, dtype: datetime64[ns]
/tmp/ipykernel_28852/856219756.py:4: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
to_datetime(df['date']), # pandas is very good at guessing datetime format
Date time-aware selections#
The best part about working with pandas
is the power of the Index
. So it should come to no surprise that working with a Series
or DataFrame
that has a DatetimeIndex
is also a convenient experience (if you use the .index
appropriately).
from pandas import Series, date_range
from numpy import arange
s = Series(
index=date_range('2023-01-05', periods=100, freq='5H'),
data=arange(100),
name='signal 1',
)
# .loc implicitly converts strings to date times and selects data accordingly
s.loc['2023-01-10']
/tmp/ipykernel_28852/4152648986.py:5: FutureWarning: 'H' is deprecated and will be removed in a future version, please use 'h' instead.
index=date_range('2023-01-05', periods=100, freq='5H'),
2023-01-10 00:00:00 24
2023-01-10 05:00:00 25
2023-01-10 10:00:00 26
2023-01-10 15:00:00 27
2023-01-10 20:00:00 28
Freq: 5h, Name: signal 1, dtype: int64
By specifying a date time selection of ‘2023-01-10’, we are able to index all of the date times that fall under that date. So, by being less specific, we can select broad swaths of dates.
s.loc['2023-01-10':'2023-01-12'] # inclusive slicing
2023-01-10 00:00:00 24
2023-01-10 05:00:00 25
2023-01-10 10:00:00 26
2023-01-10 15:00:00 27
2023-01-10 20:00:00 28
2023-01-11 01:00:00 29
2023-01-11 06:00:00 30
2023-01-11 11:00:00 31
2023-01-11 16:00:00 32
2023-01-11 21:00:00 33
2023-01-12 02:00:00 34
2023-01-12 07:00:00 35
2023-01-12 12:00:00 36
2023-01-12 17:00:00 37
2023-01-12 22:00:00 38
Freq: 5h, Name: signal 1, dtype: int64
This syntax even extends to slicing our data, so we can grab all of the rows between two dates as well!
However, sometimes we don’t want to select in this hierarchical/contiguous fashion.
What if we wanted to select the rows of our data from 9am-5pm from EVERY day?
We’ll need to apply a little logic here to construct a boolean mask. But before
we do that, let’s discuss how we can access the hour
component of our date times.
For a DatetimeIndex
object, we can access various components of our values like so:
s.index # access the index (which is a `DatetimeIndex`)
s.index.hour
Index([ 0, 5, 10, 15, 20, 1, 6, 11, 16, 21, 2, 7, 12, 17, 22, 3, 8, 13,
18, 23, 4, 9, 14, 19, 0, 5, 10, 15, 20, 1, 6, 11, 16, 21, 2, 7,
12, 17, 22, 3, 8, 13, 18, 23, 4, 9, 14, 19, 0, 5, 10, 15, 20, 1,
6, 11, 16, 21, 2, 7, 12, 17, 22, 3, 8, 13, 18, 23, 4, 9, 14, 19,
0, 5, 10, 15, 20, 1, 6, 11, 16, 21, 2, 7, 12, 17, 22, 3, 8, 13,
18, 23, 4, 9, 14, 19, 0, 5, 10, 15],
dtype='int32')
If our date time data is in a Series
object, we’ll need to use the
.dt
accessor:
print(Series(s.index).dt.hour)
0 0
1 5
2 10
3 15
4 20
..
95 19
96 0
97 5
98 10
99 15
Length: 100, dtype: int32
From here we can use just a bit of logic and construct our mask:
s.loc[(9 <= s.index.hour) & (s.index.hour <= 17)].head()
2023-01-05 10:00:00 2
2023-01-05 15:00:00 3
2023-01-06 11:00:00 7
2023-01-06 16:00:00 8
2023-01-07 12:00:00 12
Name: signal 1, dtype: int64
Of course, there is a convenience method for nearly everything in pandas
so we
can also use the .between_time
method as well:
s.between_time('09:00', '17:00').head() # pandas parses a time string for us!
2023-01-05 10:00:00 2
2023-01-05 15:00:00 3
2023-01-06 11:00:00 7
2023-01-06 16:00:00 8
2023-01-07 12:00:00 12
Name: signal 1, dtype: int64
Working With Multiple Time Series#
When working with multiple sets of time series data, we need to perform some type of alignment before we can analyze data.
from IPython.display import display, Markdown
from pandas import Series, date_range, merge_asof, Timedelta
from numpy import arange
s1 = Series(
index=date_range('2023-01-05', periods=10, freq='3D'),
data=arange(10),
name='signal 1',
)
s2 = Series(
index=date_range('2023-01-05', periods=10, freq='2D'),
data=arange(10) + 10,
name='signal 2'
)
display(
Markdown("3 Day Freq"),
s1,
Markdown("2 Day Freq"),
s2
)
3 Day Freq
2023-01-05 0
2023-01-08 1
2023-01-11 2
2023-01-14 3
2023-01-17 4
2023-01-20 5
2023-01-23 6
2023-01-26 7
2023-01-29 8
2023-02-01 9
Freq: 3D, Name: signal 1, dtype: int64
2 Day Freq
2023-01-05 10
2023-01-07 11
2023-01-09 12
2023-01-11 13
2023-01-13 14
2023-01-15 15
2023-01-17 16
2023-01-19 17
2023-01-21 18
2023-01-23 19
Freq: 2D, Name: signal 2, dtype: int64
If we wanted to combine these two datasets–say, add our "value"
columns together–we cannot naively do this unless we want erroneous results. Instead, we need to respect their date component, align based on that, and then perform our addition.
Thankfully this is the entire purpose of the index!
s1 + s2
2023-01-05 10.0
2023-01-07 NaN
2023-01-08 NaN
2023-01-09 NaN
2023-01-11 15.0
2023-01-13 NaN
2023-01-14 NaN
2023-01-15 NaN
2023-01-17 20.0
2023-01-19 NaN
2023-01-20 NaN
2023-01-21 NaN
2023-01-23 25.0
2023-01-26 NaN
2023-01-29 NaN
2023-02-01 NaN
dtype: float64
We can see that the above code relied on index alignment to perform the addition,
but we have NaN
s wherever the alignment failed. We can manually fix this issue by
constructing our own index that is a union of the current indices and fill in the
missing values with 0
.
This trick of manually constructing an index and aligning data to it is what pandas
does under-the-hood in most cases. However, I find that it is incredibly useful
to exert some manual control over these operations so that I can work more effectively
with index alignment instead of against it.
new_index = s1.index.union(s2.index).sort_values()
s1.reindex(new_index, fill_value=0) + s2.reindex(new_index, fill_value=0)
2023-01-05 10
2023-01-07 11
2023-01-08 1
2023-01-09 12
2023-01-11 15
2023-01-13 14
2023-01-14 3
2023-01-15 15
2023-01-17 20
2023-01-19 17
2023-01-20 5
2023-01-21 18
2023-01-23 25
2023-01-26 7
2023-01-29 8
2023-02-01 9
dtype: int64
Alternatively, we can rely on some of the keyword argument conveniences from .add
to perform the same result:
s1.add(s2, fill_value=0)
2023-01-05 10.0
2023-01-07 11.0
2023-01-08 1.0
2023-01-09 12.0
2023-01-11 15.0
2023-01-13 14.0
2023-01-14 3.0
2023-01-15 15.0
2023-01-17 20.0
2023-01-19 17.0
2023-01-20 5.0
2023-01-21 18.0
2023-01-23 25.0
2023-01-26 7.0
2023-01-29 8.0
2023-02-01 9.0
dtype: float64
Wrap Up#
That’s all the time for today! pandas
has much more power when dealing with time series
data that I’ll dive into in a future post. There is one final important distinction
one needs to make when working with date times in pandas
, and a whole nest of
un-discussed capabilities. Check out next week’s blog entry where I will discuss this distinction in the final chapter of datetimes in pandas
!