Working With Slightly Messy Data#

Hello, everyone! This week, I want to discuss working with real-world datasets. Specifically, how it’s common (and even expected) to encounter a number of data quality issues.

Some common questions you want to ask yourself when working with a new dataset are…

Can the values be worked with in their existing format?
Are there null values and, if so, why might there be null values?
Is this dataset from a primary or secondary source? How much preprocessing has already been done to it that was not in my control?

These questions are very important to consider when working with data as they will guide you on how you should clean your data. Let’s walk through an example of some data pulled from the internet.

If this is interesting for you, then you’re going to love our next seminar, “Spot the Lies Told with this Data,” which I’ll be hosting on March 11th!

Grabbing Some Data#

The dataset we’ll be working with is a weather report from Newark Liberty International Airport. There are near-hourly recordings of the weather for a given day (In this case, February 22, 1980.) The dataset is quite small but will serve as a good example for working with some slightly messy data.

The data is served by a third-party API to wunderground.com/history. (This means that the tables are populated by javascript API calls after the page has loaded.) I was feeling a little too lazy to investigate the API calls the site was making, so I decided to leverage a “headless” selenium session to allow the data to be loaded via javascript before scraping the table out of the html.

Want to check out the url in question? Here’s the link: https://www.wunderground.com/history/daily/KEWR/date/1980-2-22

from pandas import read_html

from io import StringIO
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait

options = webdriver.ChromeOptions()
options.add_argument('--headless')

driver = webdriver.Chrome(options=options)
driver.get("https://www.wunderground.com/history/daily/KEWR/date/1980-2-22")
WebDriverWait(driver, timeout=10).until(
    lambda d: d.find_element(By.TAG_NAME, "table")
)

temperatures_df = read_html(StringIO(driver.page_source))[1]
temperatures_df.to_pickle('data/KEWR-weather-1980-02-22.pickle')

from pandas import read_pickle

temperatures_df = read_pickle('data/KEWR-weather-1980-02-22.pickle')
temperatures_df.head()

	Time	Temperature	Dew Point	Humidity	Wind	Wind Speed	Pressure	Condition
0	1:00 AM	40 °F	0 °F	73 °%	NNE	13 °mph	30.03 °in	Partly Cloudy
1	2:00 AM	38 °F	31 °F	76 °%	N	7 °mph	30.09 °in	Cloudy
2	3:00 AM	39 °F	31 °F	73 °%	NE	5 °mph	30.08 °in	Cloudy
3	4:00 AM	38 °F	31 °F	76 °%	NNE	7 °mph	30.06 °in	Cloudy
4	5:00 AM	40 °F	31 °F	70 °%	NE	9 °mph	30.06 °in	Cloudy

Inspecting Your Data#

As the saying goes, “A picture is worth a thousand words”, and in the case of data visualization, this is especially true. With a small number of features, we can effectively communicate inferences about the data with simple tables. This is because there isn’t much information to wade through and not many branching paths one can take when investigating summaries of smaller datasets.

from IPython.display import display, Markdown

display(
    Markdown(f'**Number of Rows in data {len(temperatures_df)}**'),
    Markdown('**Describe:**'),
    temperatures_df.describe(include='all'),
    Markdown('**Dtypes**'),
    temperatures_df.dtypes,
    Markdown('**5 row sample**'),
    temperatures_df.sample(5, random_state=0)
)

Number of Rows in data 44

Describe:

	Time	Temperature	Dew Point	Humidity	Wind	Wind Speed	Wind Gust	Pressure	Precip.	Condition
count	34	34	34	34	33	34	34	34	34	34
unique	34	7	5	10	6	9	1	17	3	3
top	1:00 AM	0 °F	0 °F	0 °%	NE	10 °mph	0 °mph	30.10 °in	0.0 °in	Cloudy
freq	1	11	16	12	10	8	34	7	32	32

Dtypes

Time           object
Temperature    object
Dew Point      object
Humidity       object
Wind           object
Wind Speed     object
Wind Gust      object
Pressure       object
Precip.        object
Condition      object
dtype: object

5 row sample

	Time	Temperature	Dew Point	Humidity	Wind	Wind Speed	Wind Gust	Pressure	Precip.	Condition
30	10:00 PM	36 °F	33 °F	89 °%	N	12 °mph	0 °mph	29.78 °in	0.0 °in	Cloudy
37	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
27	8:16 PM	0 °F	0 °F	0 °%	ENE	14 °mph	0 °mph	29.76 °in	0.0 °in	Mostly Cloudy
4	5:00 AM	40 °F	31 °F	70 °%	NE	9 °mph	0 °mph	30.06 °in	0.0 °in	Cloudy
10	8:00 AM	34 °F	0 °F	92 °%	E	10 °mph	0 °mph	30.11 °in	0.3 °in	Cloudy

Before we discuss NaN values, we have a parsing issue: many columns have their units as part of their data. To work effectively here, we should instead move the units into the column names so we are working with truly numeric data.

Parsing Values from Units#

An important aspect of working in a notebook format is to ensure that each cell can be run independently of the cells below it. By enforcing this linear design, we want to ensure that…

Each cell can be run multiple times without needing to refresh upstream cells
Changes downstream should not change upstream cells. This means you should try to avoid reusing the same variable names.

from pandas import DataFrame, to_numeric, to_datetime

units = {
    'Temperature': '°F', 
    'Dew Point': '°F',
    'Humidity': '°%', 
    'Wind Speed': '°mph',
    'Wind Gust': '°mph',
    'Pressure': '°in', 
    'Precip.': '°in'
}

parsed_data = {}
for colname, unit in units.items():
    parsed_col = (
        temperatures_df[colname].str.extract(fr'(\d+)\s*{unit}', expand=False)
    )
    parsed_data[f'{colname} ({unit})'] = to_numeric(parsed_col)


parsed_df = (
    DataFrame(parsed_data).join(temperatures_df.drop(columns=units.keys()))
    .assign(
        Time=lambda d: to_datetime('1980-02-22'+d['Time'], format='%Y-%m-%d%I:%M %p')
   )
)

display(
    Markdown('**Describe:**'),
    parsed_df.describe(include='all'),
    Markdown('**Dtypes**'),
    parsed_df.dtypes,
    Markdown('**5 row sample**'),
    parsed_df.sample(5, random_state=0)
)

Describe:

	Temperature (°F)	Dew Point (°F)	Humidity (°%)	Wind Speed (°mph)	Wind Gust (°mph)	Pressure (°in)	Precip. (°in)	Time	Wind	Condition
count	34.000000	34.000000	34.000000	34.000000	34.0	34.000000	34.000000	34	33	34
unique	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	6	3
top	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NE	Cloudy
freq	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	10	32
mean	24.558824	16.823529	54.382353	10.382353	0.0	38.882353	0.117647	1980-02-22 11:40:30	NaN	NaN
min	0.000000	0.000000	0.000000	0.000000	0.0	3.000000	0.000000	1980-02-22 00:00:00	NaN	NaN
25%	0.000000	0.000000	0.000000	9.000000	0.0	9.000000	0.000000	1980-02-22 07:24:15	NaN	NaN
50%	35.000000	30.000000	74.500000	10.000000	0.0	10.000000	0.000000	1980-02-22 09:48:30	NaN	NaN
75%	36.000000	31.000000	89.000000	12.750000	0.0	79.000000	0.000000	1980-02-22 17:45:00	NaN	NaN
max	40.000000	34.000000	93.000000	14.000000	0.0	96.000000	3.000000	1980-02-22 23:00:00	NaN	NaN
std	17.322592	16.127279	41.272938	2.902608	0.0	37.529596	0.537373	NaN	NaN	NaN

Dtypes

Temperature (°F)            float64
Dew Point (°F)              float64
Humidity (°%)               float64
Wind Speed (°mph)           float64
Wind Gust (°mph)            float64
Pressure (°in)              float64
Precip. (°in)               float64
Time                 datetime64[ns]
Wind                         object
Condition                    object
dtype: object

5 row sample

	Temperature (°F)	Dew Point (°F)	Humidity (°%)	Wind Speed (°mph)	Wind Gust (°mph)	Pressure (°in)	Precip. (°in)	Time	Wind	Condition
30	36.0	33.0	89.0	12.0	0.0	78.0	0.0	1980-02-22 22:00:00	N	Cloudy
37	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaT	NaN	NaN
27	0.0	0.0	0.0	14.0	0.0	76.0	0.0	1980-02-22 20:16:00	ENE	Mostly Cloudy
4	40.0	31.0	70.0	9.0	0.0	6.0	0.0	1980-02-22 05:00:00	NE	Cloudy
10	34.0	0.0	92.0	10.0	0.0	11.0	3.0	1980-02-22 08:00:00	E	Cloudy

Null Value Exploration#

With the datatype parsing (mostly) complete (Wind and Condition could be categorical, but I’ll skip that for now), we can move onto our NaNs.

Clearly, we have a NaN issue. This is fairly common when scraping data as the input is not always ingested reliably, or the HTML is not always formatted correctly. In this case, the table element was built dynamically with JavaScript, so we should expect some data cleanliness issues.

But before we go charging ahead with nan-filling, dropping, or interpolation, let’s examine where the NaN values are in our data. This can be done through a number of summary metrics and some neat visualizations.

display(
    Markdown('**Number of NaN per feature**'),
    (
        parsed_df.agg(['count', 'size']).T
        .assign(delta_nulls=lambda d: d['size'] - d['count'])
    )
)

Number of NaN per feature

	count	size	delta_nulls
Temperature (°F)	34	44	10
Dew Point (°F)	34	44	10
Humidity (°%)	34	44	10
Wind Speed (°mph)	34	44	10
Wind Gust (°mph)	34	44	10
Pressure (°in)	34	44	10
Precip. (°in)	34	44	10
Time	34	44	10
Wind	33	44	11
Condition	34	44	10

Visualizing Null Values#

We can see that there is a static number of NaN values per feature (aside from “Wind” where there is 1 extra null value!). While it’s important to know this as a summary statistic, it is also important to know where these values occur in our data. Are they all in the same row? Do they exhibit any relationship?

A simple exploratory visualization can help answer these questions for us:

from matplotlib.pyplot import subplots, setp
from matplotlib.ticker import MultipleLocator

def plot_nulls(df, ax):   
    ax.pcolormesh(df.columns, df.index, df.isna(), cmap='Greys')

    # Add vertical gridlines to distinguish columns
    ax.xaxis.grid(which='minor')
    ax.xaxis.set_minor_locator(MultipleLocator(0.5))
    ax.xaxis.set_tick_params(which='both', length=0, width=0)

    # invert heatmap & move the xticks to the top to
    #   tighten relationship to DataFrame
    ax.xaxis.tick_top()
    ax.invert_yaxis()

    setp(
        ax.get_xticklabels(), rotation=-30, ha='right',
        va='center', rotation_mode='anchor'
    )
    return ax

fig, ax = subplots()
ax = plot_nulls(parsed_df, ax=ax)

../_images/ea54605122cd10cb05da67af97d37a8fd786d49ee7cc55da1c433250048ea420.png

This is a fairly standard way to investigate where null values occur in your DataFrame. We’ve reduced each value in our DataFrame to a color: black or white. In this case, whereever you see a black tile is where we have a null value in our data. We can add features like spark lines and total counts in the margins, which I’d personally add in manually (I enjoy the control matplotlib provides), but we can also reach for the missingno package.

Regardless of your route of ending up for this null-matrix, this visualization draws our attention to the large block of null values across all columns in the bottom ~10 rows of our data. Since these nulls span acorss every single columns, it may be safe to assum that this was due to a data parsing issue, so we are probably safe to drop these values rows entirely.

from matplotlib.pyplot import subplot_mosaic
from matplotlib.ticker import IndexLocator, MultipleLocator

parsed_df = parsed_df.dropna(how='all')

# Let's spruce up the visualization even more too!
mosaic = [
    ['matrix', 'spark'],
    ['totals', '.'     ],
]

fig, axd = subplot_mosaic(
    mosaic, dpi=120, figsize=(8, 6),
    gridspec_kw={
        'width_ratios': [4, 1], 'height_ratios': [3, 1], 'top': .8,
        'hspace': .1, 'wspace': .1
    },
)

plot_nulls(parsed_df, ax=axd['matrix'])
axd['matrix'].spines['bottom'].set_visible(False)

axd['spark'].step(parsed_df.isna().sum(axis='columns'), parsed_df.index, where='post')
axd['spark'].sharey(axd['matrix'])
axd['spark'].set_title('Row-wise null count', ha='left', loc='left')
axd['spark'].yaxis.set_label_position('right')
axd['spark'].yaxis.set_visible(False)
axd['spark'].spines['left'].set_visible(False)

axd['totals'].bar(parsed_df.columns, parsed_df.isna().sum().div(len(parsed_df)))
axd['totals'].set_ylim(bottom=0, top=1)
axd['totals'].sharex(axd['matrix'])
axd['totals'].xaxis.set_tick_params('both', length=0, width=0)
axd['totals'].yaxis.set_major_locator(MultipleLocator(.5))
axd['totals'].yaxis.set_major_formatter(lambda x, pos: f'{100*x:g}%')
axd['totals'].set_xlabel('Column-wise Null Fraction')
setp(axd['totals'].get_xticklabels(), visible=False);

fig.suptitle('Exploring Null Values in our Data');

../_images/94c31e90931abf36cb538f0bce6cd4175f0cdcd432c9b6f2e4c6b9010e848ca9.png

Since the above layout was a little boring, let’s see how this visualization would look if we had many more null data points in our dataset!

from numpy.random import default_rng
from matplotlib.pyplot import subplot_mosaic
from matplotlib.ticker import IndexLocator, MultipleLocator

rng = default_rng(0)
mask = rng.choice([0, 1], p=[.7, .3], size=parsed_df.shape).astype(bool)
more_nulls = parsed_df.mask(mask)

from matplotlib.pyplot import subplot_mosaic
from matplotlib.ticker import IndexLocator, MultipleLocator

parsed_df = parsed_df.dropna(how='all')

# Let's spruce up the visualization even more too!
mosaic = [
    ['matrix', 'spark'],
    ['totals', '.'     ],
]

fig, axd = subplot_mosaic(
    mosaic, dpi=120, figsize=(8, 6),
    gridspec_kw={
        'width_ratios': [4, 1], 'height_ratios': [3, 1], 'top': .8,
        'hspace': .1, 'wspace': .1
    },
)

plot_nulls(more_nulls, ax=axd['matrix'])
axd['matrix'].spines['bottom'].set_visible(False)

axd['spark'].step(more_nulls.isna().sum(axis='columns'), parsed_df.index, where='post')
axd['spark'].sharey(axd['matrix'])
axd['spark'].set_title('Row-wise null count', ha='left', loc='left')
axd['spark'].yaxis.set_label_position('right')
axd['spark'].yaxis.set_visible(False)
axd['spark'].spines['left'].set_visible(False)
axd['spark'].xaxis.set_major_locator(MultipleLocator(2))

axd['totals'].bar(more_nulls.columns, more_nulls.isna().sum().div(len(more_nulls)))
axd['totals'].set_ylim(bottom=0, top=1)
axd['totals'].sharex(axd['matrix'])
axd['totals'].xaxis.set_tick_params('both', length=0, width=0)
axd['totals'].yaxis.set_major_locator(MultipleLocator(.5))
axd['totals'].yaxis.set_major_formatter(lambda x, pos: f'{100*x:g}%')
axd['totals'].set_xlabel('Column-wise Null Fraction')
setp(axd['totals'].get_xticklabels(), visible=False);

fig.suptitle('Exploring Null Values in our Data');

../_images/aed4c58173ddd372faa2d1e9e6a5987f3d3e60e17b5844c32928df667b547ae4.png

It seems that we successfully dropped those null rows, and even generated some marginal plots for quick at a glance summaries! Obviously this type of plotting would be more meaningful for a larger dataset, but I wanted to familiarize you with some of the way that you can investigate a dataset visually.

After inspecting our data for nulls, one might want to investigate the cause of null values by inspecting their relationship. Do any features have high null correlation? If so, what was the data generation process that created those values?

Since our dataset only has one NaN value remaining, we should instead focus on the values we have:

Visualizing Numeric Data#

from matplotlib.dates import DateFormatter
from math import ceil
ncols = 2

numeric_data = parsed_df.set_index('Time').select_dtypes('number').sort_index()

fig, axes = subplots(
    ceil(numeric_data.columns.size / ncols), ncols, figsize=(12, 6),
    sharex=True, sharey=False, gridspec_kw={'top': .85, 'hspace': 1},
)

for i, (ax, col) in enumerate(zip(axes.flat, numeric_data.columns)):
    s = numeric_data[col]
    ax.plot(s.index, s)
    ax.set_title(col)

    
for ax in axes[-1, :]:
    ax.set_xlabel('Hour of Day')
    ax.xaxis.set_major_formatter(DateFormatter('%I'))
    
for ax in axes.flat[i+1:]:
    ax.yaxis.set_visible(False)
    ax.spines['left'].set_visible(False)
    ax.set_xlabel('Hour of Day')

fig.suptitle('Numeric Dtype Columns From Weather Data');

../_images/0fca8942528f7946f6047e6f0a63cc5090bf32e1362c834e7ef415d21dd30a39.png

After investigating the timeseries data above, it seems that we have a few odd readings on Temperature, Dew Point, and Humidity. I’m no climate scientist, but I’ve never seen outdoor temperatures drop from ~30-40 degrees F down to 0, then back up again in a matter of minutes.

Considering that this data is not a primary source, I’m going to assume that these values were filled in with 0s by some other piece of automation. For my investigation, these should probably be replaced with NaN values. This allows me to either interpolate them or deal with them in another manner.

from numpy import nan

replacements = {
    'Temperature (°F)': {0: nan},
    'Dew Point (°F)' : {0: nan},
    'Humidity (°%)': {0: nan},
}
interpolated_data = (
    numeric_data.replace(replacements)
    .interpolate(method='time') # our index is time-aware
)

fig, axes = subplots(
    ceil(interpolated_data.columns.size / ncols), ncols, figsize=(12, 6),
    sharex=True, sharey=False, gridspec_kw={'top': .85, 'hspace': 1},
)

for i, (ax, col) in enumerate(zip(axes.flat, interpolated_data.columns)):
    s = interpolated_data[col]
    ax.plot(s.index, s)
    ax.set_title(col)

    
for ax in axes[-1, :]:
    ax.set_xlabel('Hour of Day')
    ax.xaxis.set_major_formatter(DateFormatter('%I'))
    
for ax in axes.flat[i+1:]:
    ax.yaxis.set_visible(False)
    ax.spines['left'].set_visible(False)
    ax.set_xlabel('Hour of Day')

fig.suptitle('Numeric Dtype Columns From Weather Data');

../_images/44ca4e2b29ad5fdd3f7953201b1296efd44a38725862e0c07d2c6aaf91f1bca4.png

Wrap-up#

That’s it for this week! I hoped you enjoyed this quick dive into working with slightly messy data.

If you’d like to dig a little deeper into this topic, join us for our upcoming seminar, “Spot the Lies Told with this Data” which I’ll be hosting on March 11th!

And, being a reader of our blog, use the link below to save 10% off of any ticket tier! “Spot the Lies Told with this Data,”

Talk to you again next week!

Dealing With Dates in pandas - Part 3 What the Index?

Recent Posts

Tags