Intentional Visualizations#
Hello, everyone! This week, I want to discuss the often-overlooked exploratory charts.
I often speak to a dichotomy of purposes whenever I discuss data visualization. These purposes are designed to help organize our thoughts about both why and how we should visualize our data in the first place. The reasons one might reach for a visualization are:
Exploratory
Communicative
Exploratory charts explore data often with a minimal theory or hypothesis to search for. A communicative chart will always have some theory behind it. Therefore, it should be much more intentional and be designed to convey a specific message about the data that are displayed.
The reason I said that these charts are often overlooked is because I often see exploration raised in defense of poor chart choice and design. Just because we are exploring data does not mean one should just attempt to visualize all of the data in any form. Even though we may not have a specific hypothesis or theory in mind, we should not preclude this fact:
Exploratory charts should facilitate the exploration of data.
Let’s take a look at an example from some synthetic data.
Data#
from numpy.random import default_rng
from pandas import DataFrame, date_range
rng = default_rng(0)
df = DataFrame(
index=(dates := date_range('2000-01-01', freq='MS', periods=12)),
data={
'north': rng.normal(10_000, scale=1_000, size=dates.size),
'east': rng.normal(11_000, scale=3_000, size=dates.size),
'south': rng.normal(12_000, scale=5_000, size=dates.size),
'west': rng.normal( 8_000, scale=2_000, size=dates.size),
}
).round(2)
df.head()
north | east | south | west | |
---|---|---|---|---|
2000-01-01 | 10125.73 | 4024.91 | 16517.35 | 6692.34 |
2000-02-01 | 9867.90 | 10343.63 | 12470.06 | 7740.77 |
2000-03-01 | 10640.42 | 7262.27 | 8282.50 | 9567.95 |
2000-04-01 | 10104.90 | 8803.20 | 7391.37 | 10986.86 |
2000-05-01 | 9464.33 | 9367.22 | 9711.37 | 5481.87 |
Just get it on the screen!#
“Just get it on the screen” is the worst mindset we can take when creating a chart. Exploration involves the visual search for patterns (e.g., do values tend to flow in a specific direction; across what axis or grouping does this occur?).
Let’s create a chart with little to no thought:
%matplotlib inline
from matplotlib.pyplot import rc
rc('font', size=20)
rc('figure', facecolor='white', dpi=100)
ax = df.plot.bar(figsize=(18, 6), legend=False)
ax.set_xticklabels(df.index.strftime('%B'))
ax.xaxis.set_tick_params(rotation=0)
ax.legend(loc='upper right', bbox_to_anchor=(1, 1), ncols=4)
ax.yaxis.set_major_formatter(lambda x, pos: f'${x/1000:g}k')
ax.set_title('Revenue by Region over Months (2000)', loc='left', size='x-large');
While the above chart does visualize the data, we have no justification of why we used a clustered bar chart in the first place. But, let’s take a moment to think about this. What comparisons does a clustered bar chart facilitate?
Look closely at the chart to try to perform two comparisons:
Compare one region to another within a given month
Compare the same region from one month to another
The chart we have selected strongly favors the former over the latter. While we can perform cross-month comparisons here, it is tedious. In fact, I would even argue that the use of a legend makes the within-month comparison harder (you need to repeatedly reference the legend to see which region corresponds to which color).
Let’s redo this chart and strengthen the comparison of one region to another within a given month, which is what we want to facilitate.
Intentionality: Cross-Region, Within-Month#
A few things need to happen to facilitate the comparison of interest here:
Remove the legend and directly label our regions
Prevent cross-month comparisons from competing with one another
The simplest way to accommodate these goals is to create a bar chart via
small multiples instead of cramming all comparisons onto the same set of Axes
.
With multiple Axes
, we can directly label our regions and reduce interference
from competing comparisons. Let’s take a look at how we can accomplish this via
Matplotlib:
from numpy import where
from matplotlib.pyplot import subplots
from matplotlib.ticker import MultipleLocator
fig, axes = subplots(
3, 4, figsize=(18, 9),
layout='constrained',
sharey=True, sharex=True,
gridspec_kw={'hspace':.2, 'wspace': .1},
)
for ax, (ts, row) in zip(axes.flat, df.iterrows()):
bc = ax.bar(row.index, row)
for rect in bc:
if rect.get_height() < row.max():
rect.set_alpha(.5)
ax.set_title(ts.strftime('%B'), loc='left', size='x-large')
ax.xaxis.set_tick_params(labelbottom=True, bottom=False)
ax.yaxis.set_tick_params(labelleft=True, left=False)
ax.spines[['top', 'left', 'right']].set_visible(False)
# options are shared since y Axes are shared
ax.yaxis.set_major_locator(MultipleLocator(5000))
ax.yaxis.set_major_formatter(lambda x, pos: f'${x/1000:g}k')
fig.canvas.draw()
bbox = axes.flat[0]._left_title.get_tightbbox()
ax.figure.text(
s='Revenue by Region over Months (2000)',
x=bbox.x0,
y=bbox.y1 + 10,
ha='left',
va='bottom',
size='xx-large',
transform=None,
);
Notice how easy it is to compare the revenue generated each month from each region. Furthermore, this also enabled us to include some additional supplemental information, highlighting the maximum within each month at no extra cost. This is a chart that doesn’t just “show the data,” it facilitates exploration.
But what if we wanted to focus on the opposite comparison?
Intentionality: Cross-Month, Within-Region#
Hello everyone! Today, I want to continue the topic of last week’s blog post. This time, I want to showcase how to facilitate exploration for the inverted question: “How do we better explore within a region, across the months?”
Since the focus now is on comparing datapoints within a region, we’re going to
use a line chart. We can use DataFrame.plot
to create the
simplest view possible.
ax = df.plot(marker='o', figsize=(16, 6), ls='--', legend=False)
ax.legend(loc='upper right', ncol=4, bbox_to_anchor=(1, 1))
ax.yaxis.set_major_formatter(lambda x, pos: f'${x/1_000:g}k')
ax.set_xlim(auto=True)
ax.margins(x=.01)
ax.set_title('Revenue by Region over Months (2000)', loc='left', size='x-large');
While this adequately plots the data, it’s not particularly inspiring. The reason I dislike this specific chart is that many of these lines occlude one another, and the use of color is a minor nuisance because it becomes visually challenging to follow a single line. This points to an issue directly related to superimposing the regions on top of one another. Instead of superimposing along this dimension, we can try juxtaposing it.
axes = df.plot(
marker='o', subplots=True, layout=(2, 2), figsize=(20, 6),
sharey=True, sharex=True, legend=False,
)
fig = axes.flat[0].figure
for ax, colname in zip(axes.flat, df.columns):
ax.yaxis.set_major_formatter(lambda x, pos: f'${x/1_000:g}k')
ax.set_xlim(auto=True)
ax.margins(x=.01)
ax.set_title(colname, loc='left')
ax.yaxis.set_tick_params(labelleft=True)
axes.flat[0].annotate(
'Revenue by Region over Months (2000)',
xy=(0, 1), xycoords=(axes.flat[0]._left_title),
xytext=(0, 10), textcoords='offset points',
size='xx-large',
)
fig.tight_layout()
This chart allows us to easily trace the change in monthly revenue from each region. From here, we can inspect for periodicity or outliers within a region. In this case, we have abandoned the cross-region comparisons since we now need to compare across charts. This might seem fine for our purposes, but the other regions provide something important: context. The previous approach, wherein we used superimposition, communicated the context of the data, answering questions like “Does any one region ‘stand out’ compared to the rest?”
Our previous chart accomplished this but made it hard to focus on any single line. Instead, we can reach for a juxtaposition trick called “small multiples” to provide context for each line. Effectively, we are going to recreate the above chart, adding in background lines representing the rest of the data, allowing us to view the trends within a single region while also comparing it to the rest of the regions data (without specificity).
from matplotlib.dates import DateFormatter, MonthLocator
rc('axes.spines', top=False, right=False, left=False)
fig, axes = subplots(
2, 2, figsize=(20, 6),
sharex=True, sharey=True,
layout='constrained',
gridspec_kw={'hspace': .1},
)
for (name, s), ax in zip(df.items(), axes.flat):
context_df = df.drop(columns=[name])
ax.plot(context_df.index, context_df, marker='o', color='gainsboro')
ax.plot(s.index, s, marker='o', lw=2, ms=8, ls='--')
ax.set_title(name.title(), loc='left', size='x-large')
ax.yaxis.set_tick_params(labelleft=True)
ax.yaxis.set_major_formatter(lambda x, pos: f'${x/1000:g}k')
ax.xaxis.set_major_locator(MonthLocator())
ax.xaxis.set_major_formatter(DateFormatter('%b'))
ax.margins(x=.02)
fig.canvas.draw()
title_bbox = axes.flat[0]._left_title.get_tightbbox()
axes_bbox = axes.flat[0].get_tightbbox()
fig.text(
s='Revenue by Region over Months (2000)',
x=axes_bbox.x0,
y=title_bbox.y1 + 10,
ha='left',
va='bottom',
size='xx-large',
transform=None,
);
There we have it: a chart that appropriately facilitates comparisons within each region while also providing contextual data points. This chart helps identify any trends within each region while also obviating outliers that exist across regions (within a given month).
Wrap-Up#
Next time you work on some data visualization, don’t just “throw data on the screen”; make something obvious and meaningful with real intention behind it.
What do you think about my approach? Anything you’d do differently? Something not making sense? Let me know on the DUTC Discord server.
Talk to you all next week!