Dealing With Dates in Pandas - Part 2

In my previous post, we discussed how we can approach date times in pandas as well as the metaphors used by the library and the differences between absolute time and calendar time (also referred to as relative time).

This week, we'll dive a little bit deeper into the functionality that pandas has to offer when dealing with time series data, covering topics like:

parsing date times
date time-aware selections
align date time data from different frequencies

Parsing date times

Text-based file formats are one of the most common forms of data storage. Unfortunately, this means that all of our data is also stored as text. This means our dates need to be converted to text and away from their numeric-based representation. In order to work with our dates effectively, we need to parse them back into their appropriate pandas representation.

from pandas import read_fwf
from io import StringIO

buf = StringIO('''
date         data
2023-jan-01  58
2023-mar-25  59
2023-nov-18  60
'''.strip('\n'))

df = read_fwf(buf)
df.dtypes # note the object dtype, which hints that this is a column of strings

date    object
data     int64
dtype: object

from pandas import to_datetime
from warnings import simplefilter
simplefilter('ignore')

print(
    to_datetime(df['date']), # pandas is good at guessing datetime format
    to_datetime(df['date'], format='%Y-%b-%d'), # though we can pass the format ourselves
    sep='\n{}\n'.format('\N{box drawings light horizontal}' * 40)
)

0   2023-01-01
1   2023-03-25
2   2023-11-18
Name: date, dtype: datetime64[ns]
────────────────────────────────────────
0   2023-01-01
1   2023-03-25
2   2023-11-18
Name: date, dtype: datetime64[ns]

You will almost ALWAYS want to pass a specified format as your code will run much faster than if you have pandas try to guess the formatting of each value you have stored. pandas versions later than 2.0 will emit a warning and there is a plan for pandas to raise an error on this behavior unless you intentionally opt-in.

Date time-aware selections

The best part about working with pandas is the power of the Index. So it should come to no surprise that working with a Series or DataFrame that has a DatetimeIndex is also a convenient experience (if you use the .index appropriately).

from pandas import Series, date_range
from numpy import arange

s = Series(
    index=date_range('2023-01-05', periods=100, freq='5h'),
    data=arange(100),
    name='signal 1',
)

# .loc implicitly converts strings to date times and selects data accordingly
s.loc['2023-01-10']

2023-01-10 00:00:00    24
2023-01-10 05:00:00    25
2023-01-10 10:00:00    26
2023-01-10 15:00:00    27
2023-01-10 20:00:00    28
Freq: 5h, Name: signal 1, dtype: int64

By specifying a date time selection of '2023-01-10', we are able to index all of the date times that fall under that date. So, by being less specific, we can select broad swaths of dates.

s.loc['2023-01-10':'2023-01-12'] # inclusive slicing

2023-01-10 00:00:00    24
2023-01-10 05:00:00    25
2023-01-10 10:00:00    26
2023-01-10 15:00:00    27
2023-01-10 20:00:00    28
2023-01-11 01:00:00    29
2023-01-11 06:00:00    30
2023-01-11 11:00:00    31
2023-01-11 16:00:00    32
2023-01-11 21:00:00    33
2023-01-12 02:00:00    34
2023-01-12 07:00:00    35
2023-01-12 12:00:00    36
2023-01-12 17:00:00    37
2023-01-12 22:00:00    38
Freq: 5h, Name: signal 1, dtype: int64

This syntax even extends to slicing our data, so we can grab all of the rows between two dates as well!

However, sometimes we don't want to select in this hierarchical/contiguous fashion. What if we wanted to select the rows of our data from 9am-5pm from EVERY day? We'll need to apply a little logic here to construct a boolean mask. But before we do that, let's discuss how we can access the hour component of our date times.

For a DatetimeIndex object, we can access various components of our values like so:

s.index # access the index (which is a `DatetimeIndex`)
s.index.hour

Index([ 0,  5, 10, 15, 20,  1,  6, 11, 16, 21,  2,  7, 12, 17, 22,  3,  8, 13,
       18, 23,  4,  9, 14, 19,  0,  5, 10, 15, 20,  1,  6, 11, 16, 21,  2,  7,
       12, 17, 22,  3,  8, 13, 18, 23,  4,  9, 14, 19,  0,  5, 10, 15, 20,  1,
        6, 11, 16, 21,  2,  7, 12, 17, 22,  3,  8, 13, 18, 23,  4,  9, 14, 19,
        0,  5, 10, 15, 20,  1,  6, 11, 16, 21,  2,  7, 12, 17, 22,  3,  8, 13,
       18, 23,  4,  9, 14, 19,  0,  5, 10, 15],
      dtype='int32')

If our date time data is in a Series object, we'll need to use the .dt accessor:

print(Series(s.index).dt.hour)

0      0
1      5
2     10
3     15
4     20
      ..
95    19
96     0
97     5
98    10
99    15
Length: 100, dtype: int32

From here we can use just a bit of logic and construct our mask:

s.loc[(9 <= s.index.hour) & (s.index.hour <= 17)].head()

2023-01-05 10:00:00     2
2023-01-05 15:00:00     3
2023-01-06 11:00:00     7
2023-01-06 16:00:00     8
2023-01-07 12:00:00    12
Name: signal 1, dtype: int64

Of course, there is a convenience method for nearly everything in pandas so we can also use the .between_time method as well:

s.between_time('09:00', '17:00').head() # pandas parses a time string for us!

2023-01-05 10:00:00     2
2023-01-05 15:00:00     3
2023-01-06 11:00:00     7
2023-01-06 16:00:00     8
2023-01-07 12:00:00    12
Name: signal 1, dtype: int64

Working With Multiple Time Series

When working with multiple sets of time series data, we need to perform some type of alignment before we can analyze data.

from IPython.display import display, Markdown
from pandas import Series, date_range, merge_asof, Timedelta
from numpy import arange

s1 = Series(
    index=date_range('2023-01-05', periods=10, freq='3D'),
    data=arange(10),
    name='signal 1',
)

s2 = Series(
    index=date_range('2023-01-05', periods=10, freq='2D'),
    data=arange(10) + 10,
    name='signal 2'
)

display(
    Markdown("3 Day Freq"),
    s1, 
    Markdown("2 Day Freq"),
    s2
)

3 Day Freq

2023-01-05    0
2023-01-08    1
2023-01-11    2
2023-01-14    3
2023-01-17    4
2023-01-20    5
2023-01-23    6
2023-01-26    7
2023-01-29    8
2023-02-01    9
Freq: 3D, Name: signal 1, dtype: int64

2 Day Freq

2023-01-05    10
2023-01-07    11
2023-01-09    12
2023-01-11    13
2023-01-13    14
2023-01-15    15
2023-01-17    16
2023-01-19    17
2023-01-21    18
2023-01-23    19
Freq: 2D, Name: signal 2, dtype: int64

If we wanted to combine these two datasets–say, add our "value" columns together–we cannot naively do this unless we want erroneous results. Instead, we need to respect their date component, align based on that, and then perform our addition.

Thankfully this is the entire purpose of the index!

s1 + s2

2023-01-05    10.0
2023-01-07     NaN
2023-01-08     NaN
2023-01-09     NaN
2023-01-11    15.0
2023-01-13     NaN
2023-01-14     NaN
2023-01-15     NaN
2023-01-17    20.0
2023-01-19     NaN
2023-01-20     NaN
2023-01-21     NaN
2023-01-23    25.0
2023-01-26     NaN
2023-01-29     NaN
2023-02-01     NaN
dtype: float64

We can see that the above code relied on index alignment to perform the addition, but we have NaNs wherever the alignment failed. We can manually fix this issue by constructing our own index that is a union of the current indices and fill in the missing values with 0.

This trick of manually constructing an index and aligning data to it is what pandas does under-the-hood in most cases. However, I find that it is incredibly useful to exert some manual control over these operations so that I can work more effectively with index alignment instead of against it.

new_index = s1.index.union(s2.index).sort_values()
s1.reindex(new_index, fill_value=0) + s2.reindex(new_index, fill_value=0)

2023-01-05    10
2023-01-07    11
2023-01-08     1
2023-01-09    12
2023-01-11    15
2023-01-13    14
2023-01-14     3
2023-01-15    15
2023-01-17    20
2023-01-19    17
2023-01-20     5
2023-01-21    18
2023-01-23    25
2023-01-26     7
2023-01-29     8
2023-02-01     9
dtype: int64

Alternatively, we can rely on some of the keyword argument conveniences from .add to perform the same result:

s1.add(s2, fill_value=0)

2023-01-05    10.0
2023-01-07    11.0
2023-01-08     1.0
2023-01-09    12.0
2023-01-11    15.0
2023-01-13    14.0
2023-01-14     3.0
2023-01-15    15.0
2023-01-17    20.0
2023-01-19    17.0
2023-01-20     5.0
2023-01-21    18.0
2023-01-23    25.0
2023-01-26     7.0
2023-01-29     8.0
2023-02-01     9.0
dtype: float64

Wrap Up

That's all the time for today! pandas has much more power when dealing with time series data that I'll dive into in a future post. There is one final important distinction one needs to make when working with date times in pandas, and a whole nest of un-discussed capabilities. Check out next week's blog entry where I will discuss this distinction in the final chapter of datetimes in pandas!

Dealing With Dates in Pandas - Part 2

Parsing date times

Date time-aware selections

Working With Multiple Time Series

Wrap Up

Table of Contents