Good pandas means good Python#

Welcome back to Cameron’s Corner! This week, I want to talk about the intersection of Python and pandas. I often hear from other teachers that it is easiest to teach skills that will help students get “up and running.” Unfortunately, this often translates to “let’s teach the pandas API.” This leads to many roadblocks down the line caused by an extremely superficial understanding of how to think about pandas operations or how to best leverage Python to lean into your pandas tasks.

So, let’s take a look at a data-cleaning example, where, while possible, working through pandas will be clumsy.

Cleaning Textual Data#

We have the below data stored in a .csv file, and, while the data are technically “comma separated,” they are not tabular. As such, we will encounter some issues ingesting this data.

from pandas import read_csv, DataFrame
from io import StringIO
from textwrap import dedent

buffer = StringIO(dedent('''


Note: the following snippet is included to shorten the lengthy tracebacks pandas generates. The focus here will be on the general nature of the error rather than any particular line from its traceback.

from traceback import print_exc
from contextlib import contextmanager

def short_traceback(limit=1):
    except Exception:

Data Ingestion#

Naive pandas

Let’s do this the straightforward way by asking the question, “can pandas parse it?” You can probably guess it won’t. If your next thought is to “look around the pandas documentation to find a function that will parse the data,” then you’re out of luck. pandas makes strong assumptions about the shape of your data after inspecting the first few lines of your .csv file.
with short_traceback(1):
Traceback (most recent call last):
  File "/tmp/ipykernel_4014583/", line 7, in short_traceback
pandas.errors.ParserError: Error tokenizing data. C error: Expected 4 fields in line 4, saw 5

Well, maybe I just need to ‘skip’ bad lines or supply some other argument to properly read the data into a DataFrame.
read_csv(buffer, on_bad_lines='skip')
device upgrade_dates
device-1 2000-01-01 2000-02-01 2000-03-01
device-2 2000-01-01 2000-04-01 NaN
device-4 2000-01-01 2000-07-01 2000-09-01

While we did read in the data, it is not in a usable form. Our index has been combined, we lost a row, and the columns do not align with the correct values.

Just a bit of Python

The Python-based solution to this problem is quite straightforward. All it takes is a little knowledge about file-objects, generators, and string operations. We could take this a step further with the csv module to ensure correct escaping of quoted field separators, but we won’t need to for this example.

def process(f):
    f = (ln.strip() for ln in f)
    yield next(f).split(',')
    for line in f:
        dev, *dates = line.split(',')
        yield dev, dates
head, *body = process(buffer)
df = DataFrame(body, columns=head)
device upgrade_dates
0 device-1 [2000-01-01, 2000-02-01, 2000-03-01]
1 device-2 [2000-01-01, 2000-04-01]
2 device-3 [2000-01-01, 2000-03-01, 2000-05-01, 2000-10-01]
3 device-4 [2000-01-01, 2000-07-01, 2000-09-01]

While the data has been parsed correctly, we are not working with optimal pandas datatypes.

device           object
upgrade_dates    object
dtype: object

Note that both of our columns are ‘object’ dtyped. The 'device' column can be easily swapped out for a ‘string’ dtype, but the other column cannot be so easily converted. Instead, we need to reshape our data using .explode.

device upgrade_dates
0 device-1 2000-01-01
0 device-1 2000-02-01
0 device-1 2000-03-01
1 device-2 2000-01-01
1 device-2 2000-04-01
2 device-3 2000-01-01
2 device-3 2000-03-01
2 device-3 2000-05-01
2 device-3 2000-10-01
3 device-4 2000-01-01
3 device-4 2000-07-01
3 device-4 2000-09-01

Now our data is in a format that is ready to analyze! However, with just a little more Python, we could have performed this transformation when we ingested the data:

def process(f):
    f = (ln.strip() for ln in f)
    yield next(f).split(',')
    for line in f:
        dev, *dates = line.split(',')
        for d in dates: # explode as pre-processing step
            yield dev, d
head, *body = process(buffer)
DataFrame(body, columns=head)
device upgrade_dates
0 device-1 2000-01-01
1 device-1 2000-02-01
2 device-1 2000-03-01
3 device-2 2000-01-01
4 device-2 2000-04-01
5 device-3 2000-01-01
6 device-3 2000-03-01
7 device-3 2000-05-01
8 device-3 2000-10-01
9 device-4 2000-01-01
10 device-4 2000-07-01
11 device-4 2000-09-01


And there you have it! Just a tiny bit of Python can be used to clean-up your pandas code, and learning when and how to leverage them is essential to get the job done.

What do you think? Let us know on the DUTC Discord server.

Talk to you all again next week!