Co-Occurrence Considerations in pandas#
Data come in a wide variety of shape and sizes, and as it turns out, analysts will spend much more time massaging and cleaning data than actually working with any particular models. The breadth of structures, shapes, and sizes of data is incredibly hard to predict—you will never know what type of dataset will come across your desk when you boss asks you to “take a look” at it.
Even more importantly, the shape of the originating data strongly guides the approaches we take to analyze it. If it is a small dataset, we may be encouraged to reshape it into something more analytically ready. However, if the dataset is large, we may abandon the thought of reshaping and attempt to analyze it as directly as possible, leveraging some awkward tricks along the way.
This week, I came across this question on Stack Overflow that inspired me to write on this topic. The original question is a fairly straightforward co-occurrence problem, but let’s take a look at how the shape of the data and potential dataset sizes come into consideration. Let’s dive in!
The Data#
Here, we’ll only look at a small data set. The author of the question on SO wanted
a co-occurrence count of each item against each number, effectively ignoring the
existent columns. Since we effectively want to ignore the existing columns,
we will either need to treat all of the "N?"
columns uniformly while respecting
their originating rows and combine results, or we will need to reshape the data and perform our co-occurrence counts. The latter approach is much simpler than the former, so
let’s take a look at that first.
from textwrap import dedent
from io import StringIO
buffer = StringIO(dedent('''
Item N1 N2 N3 N4
Item1 1 2 4 8
Item2 2 3 6 7
Item3 4 5 7 9
Item4 1 5 6 7
Item5 3 4 7 8
'''))
Pandas - Reshape & Count#
Broadly, there are two reshaping operations we use in tabular data analysis.
melt
/unpivot
→ turn columns into rowspivot
→ turn rows into columns
In this case, we are going to want to melt
our data so that we have each number
paired next to the item it originated from.
The pandas answer is quite concise. Here, we melt
the data to reshape it from wide to long format, allowing us to list each item’s values across columns.
from pandas import read_table, crosstab
df = read_table(buffer)
melted = df.melt('Item')
melted.head()
Item | variable | value | |
---|---|---|---|
0 | Item1 | N1 | 1 |
1 | Item2 | N1 | 2 |
2 | Item3 | N1 | 4 |
3 | Item4 | N1 | 1 |
4 | Item5 | N1 | 3 |
Now all we need to do is create a wide-formatted DataFrame so we end
with Items as our rows and numbers as columns and the number of times they co-occurred
as the values. In pandas, this is accomplished with pandas.crosstab
.
Crosstab: wide result#
(
crosstab(melted['Item'], melted['value'])
.rename_axis(index=None, columns=None)
)
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | |
---|---|---|---|---|---|---|---|---|---|
Item1 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 |
Item2 | 0 | 1 | 1 | 0 | 0 | 1 | 1 | 0 | 0 |
Item3 | 0 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 1 |
Item4 | 1 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 0 |
Item5 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 1 | 0 |
But, notice that our output is not back in the wide-format. If we are going to feed
these data into a statistical model, then this output is likely useful. However, if
we want to continue working with these values tabularly, we may decide to
keep things in a long format. We can take this result and melt
it again (though
we should opt to .stack
these data instead to preserve our good use of the row/column index).
Value Counts: long result#
Instead of paying to reshape the data again, we should simply use a more appropriate
solution. This is no more complex than a simple .value_counts
!
melted.value_counts(['Item', 'value']).sort_index().head(8)
Item value
Item1 1 1
2 1
4 1
8 1
Item2 2 1
3 1
6 1
7 1
Name: count, dtype: int64
Easy, right? But what if we really want the presence of the 0
s for any non-observed
co-occurrence? See how our matrix-like output differs from the result we just obtained?
Counting Categoricals: long/sparse result#
To do this, we can either convert the columns we’re counting into Categoricals, which
has support for reporting “unobserved” categories. However, to leverage this feature,
we will need to use a .groupby
instead of .value_counts
.
(
melted.astype({'Item': 'category', 'value': 'category'})
.groupby(['Item', 'value'], observed=False)
.size()
).head(6) # the full output is quite long
Item value
Item1 1 1
2 1
3 0
4 1
5 0
6 0
dtype: int64
Count & Reindex: long/sparse result#
But, if our columns aren’t already Categoricals—or if they’re already numeric—then
we could be adding overhead by performing this conversion. Instead, we may want
to perform the same .groupby
operation and align the dense result against a spine.
result = melted.value_counts(['Item', 'value'])
result.head(8)
Item value
Item1 1 1
2 1
Item5 7 1
4 1
3 1
Item4 7 1
6 1
5 1
Name: count, dtype: int64
To transition from this long/dense format, we will need to create a new index
whose codes are the Cartesian product of all levels. Then we need to align this
result against that new index and fill any missing values in with 0. Thankfully,
pandas makes this quite easy to do with a MultiIndex
constructor and .reindex
!
from pandas import MultiIndex
result.reindex(
MultiIndex.from_product(result.index.levels),
fill_value=0,
).head(8)
Item value
Item1 1 1
2 1
3 0
4 1
5 0
6 0
7 0
8 1
Name: count, dtype: int64
Wrap-Up#
Whew! A simple problem, but we have four different answers all in pandas. While each of these solutions aren’t too complex in their own right, they each have a time and place depending on how the result will be used downstream.
Data comes in all shapes, sizes, and flavors—and, as much as we’d love to jump straight into analysis, reshaping and preparing it is often the real work. Each dataset has its quirks, and the choices we make depend on its size, structure, and the type of analysis we want to run. For smaller datasets, we can afford to reshape them all we like. But for larger ones, we sometimes need to get creative and find efficient ways to wrangle them without a massive performance hit.
As we saw here, there’s rarely one “right” way to solve a data reshaping problem. Whether you’re a pandas enthusiast or experimenting with Polars, understanding the strengths of each tool can save you time and effort. Next time you’re handed a dataset that needs some transformation magic, remember that the tools you choose, and the strategies you employ, are all part of the art of data science.
from polars import from_pandas, col, element
df = from_pandas(df)
unpivoted = df.unpivot(index="Item")
unpivoted.group_by(['Item', 'value']).len().sort('Item', 'value')
Item | value | len |
---|---|---|
str | i64 | u32 |
"Item1" | 1 | 1 |
"Item1" | 2 | 1 |
"Item1" | 4 | 1 |
"Item1" | 8 | 1 |
"Item2" | 2 | 1 |
… | … | … |
"Item4" | 7 | 1 |
"Item5" | 3 | 1 |
"Item5" | 4 | 1 |
"Item5" | 7 | 1 |
"Item5" | 8 | 1 |
What do you think about my approach? Is there something you would do differently? Let me know on the DUTC Discord server.
Talk to you all next week!