Co-Occurrence Considerations in pandas#

Data come in a wide variety of shape and sizes, and as it turns out, analysts will spend much more time massaging and cleaning data than actually working with any particular models. The breadth of structures, shapes, and sizes of data is incredibly hard to predict—you will never know what type of dataset will come across your desk when you boss asks you to “take a look” at it.

Even more importantly, the shape of the originating data strongly guides the approaches we take to analyze it. If it is a small dataset, we may be encouraged to reshape it into something more analytically ready. However, if the dataset is large, we may abandon the thought of reshaping and attempt to analyze it as directly as possible, leveraging some awkward tricks along the way.

This week, I came across this question on Stack Overflow that inspired me to write on this topic. The original question is a fairly straightforward co-occurrence problem, but let’s take a look at how the shape of the data and potential dataset sizes come into consideration. Let’s dive in!

The Data#

Here, we’ll only look at a small data set. The author of the question on SO wanted a co-occurrence count of each item against each number, effectively ignoring the existent columns. Since we effectively want to ignore the existing columns, we will either need to treat all of the "N?" columns uniformly while respecting their originating rows and combine results, or we will need to reshape the data and perform our co-occurrence counts. The latter approach is much simpler than the former, so let’s take a look at that first.

from textwrap import dedent
from io import StringIO

buffer = StringIO(dedent('''
    Item	N1	N2	N3	N4
    Item1	1	2	4	8
    Item2	2	3	6	7
    Item3	4	5	7	9
    Item4	1	5	6	7
    Item5	3	4	7	8
'''))

Pandas - Reshape & Count#

Broadly, there are two reshaping operations we use in tabular data analysis.

  1. melt/unpivot → turn columns into rows

  2. pivot → turn rows into columns

In this case, we are going to want to melt our data so that we have each number paired next to the item it originated from.

The pandas answer is quite concise. Here, we melt the data to reshape it from wide to long format, allowing us to list each item’s values across columns.

from pandas import read_table, crosstab

df = read_table(buffer)
melted = df.melt('Item')
melted.head()
Item variable value
0 Item1 N1 1
1 Item2 N1 2
2 Item3 N1 4
3 Item4 N1 1
4 Item5 N1 3

Now all we need to do is create a wide-formatted DataFrame so we end with Items as our rows and numbers as columns and the number of times they co-occurred as the values. In pandas, this is accomplished with pandas.crosstab.

Crosstab: wide result#

(
    crosstab(melted['Item'], melted['value'])
    .rename_axis(index=None, columns=None)
)
1 2 3 4 5 6 7 8 9
Item1 1 1 0 1 0 0 0 1 0
Item2 0 1 1 0 0 1 1 0 0
Item3 0 0 0 1 1 0 1 0 1
Item4 1 0 0 0 1 1 1 0 0
Item5 0 0 1 1 0 0 1 1 0

But, notice that our output is not back in the wide-format. If we are going to feed these data into a statistical model, then this output is likely useful. However, if we want to continue working with these values tabularly, we may decide to keep things in a long format. We can take this result and melt it again (though we should opt to .stack these data instead to preserve our good use of the row/column index).

Value Counts: long result#

Instead of paying to reshape the data again, we should simply use a more appropriate solution. This is no more complex than a simple .value_counts!

melted.value_counts(['Item', 'value']).sort_index().head(8)
Item   value
Item1  1        1
       2        1
       4        1
       8        1
Item2  2        1
       3        1
       6        1
       7        1
Name: count, dtype: int64

Easy, right? But what if we really want the presence of the 0s for any non-observed co-occurrence? See how our matrix-like output differs from the result we just obtained?

Counting Categoricals: long/sparse result#

To do this, we can either convert the columns we’re counting into Categoricals, which has support for reporting “unobserved” categories. However, to leverage this feature, we will need to use a .groupby instead of .value_counts.

(
    melted.astype({'Item': 'category', 'value': 'category'})
    .groupby(['Item', 'value'], observed=False)
    .size()
).head(6) # the full output is quite long
Item   value
Item1  1        1
       2        1
       3        0
       4        1
       5        0
       6        0
dtype: int64

Count & Reindex: long/sparse result#

But, if our columns aren’t already Categoricals—or if they’re already numeric—then we could be adding overhead by performing this conversion. Instead, we may want to perform the same .groupby operation and align the dense result against a spine.

result = melted.value_counts(['Item', 'value'])
result.head(8)
Item   value
Item1  1        1
       2        1
Item5  7        1
       4        1
       3        1
Item4  7        1
       6        1
       5        1
Name: count, dtype: int64

To transition from this long/dense format, we will need to create a new index whose codes are the Cartesian product of all levels. Then we need to align this result against that new index and fill any missing values in with 0. Thankfully, pandas makes this quite easy to do with a MultiIndex constructor and .reindex!

from pandas import MultiIndex

result.reindex(
    MultiIndex.from_product(result.index.levels),
    fill_value=0,
).head(8)
Item   value
Item1  1        1
       2        1
       3        0
       4        1
       5        0
       6        0
       7        0
       8        1
Name: count, dtype: int64

Wrap-Up#

Whew! A simple problem, but we have four different answers all in pandas. While each of these solutions aren’t too complex in their own right, they each have a time and place depending on how the result will be used downstream.

Data comes in all shapes, sizes, and flavors—and, as much as we’d love to jump straight into analysis, reshaping and preparing it is often the real work. Each dataset has its quirks, and the choices we make depend on its size, structure, and the type of analysis we want to run. For smaller datasets, we can afford to reshape them all we like. But for larger ones, we sometimes need to get creative and find efficient ways to wrangle them without a massive performance hit.

As we saw here, there’s rarely one “right” way to solve a data reshaping problem. Whether you’re a pandas enthusiast or experimenting with Polars, understanding the strengths of each tool can save you time and effort. Next time you’re handed a dataset that needs some transformation magic, remember that the tools you choose, and the strategies you employ, are all part of the art of data science.

from polars import from_pandas, col, element

df = from_pandas(df)
unpivoted = df.unpivot(index="Item")
unpivoted.group_by(['Item', 'value']).len().sort('Item', 'value')
shape: (20, 3)
Itemvaluelen
stri64u32
"Item1"11
"Item1"21
"Item1"41
"Item1"81
"Item2"21
"Item4"71
"Item5"31
"Item5"41
"Item5"71
"Item5"81

What do you think about my approach? Is there something you would do differently? Let me know on the DUTC Discord server.

Talk to you all next week!