When the .index is convenient#

The blazingly-fast DataFrame library, Polars, has a huge conceptual difference from the DataFrame veteran, pandas: pandas is ALL about working with a consistent index, whereas Polars forces individuals to work more explicitly using joins.

I came across a question on Stack Overflow that provided a great example of the benefits of working in an index-aligned way.

The Question#

I have found how to count the values of a specified column, but I can’t find a way to do the same thing for the entire dataframe. I could count the values of each column and then combine them afterward, but it is a bit clunky and I was looking for a more elegant solution.

While there was no minimal reproducible example, here is one that was created on behalf of the original post.

input:

col_1	col_2	col_3
"a"	"b"	"a"
"b"	"c"	"a"
"c"	"d"	"e"
"a"	"b"	"f"

expected output:

value	count
"a"	4
"b"	3
"c"	2
"d"	1
"e"	1
"f"	1

Essentially, we want to take the value counts across each column and combine those results.

Polars - Tidy Data#

The simplest approach in Polars requires reshaping the data. A melt operation here takes our wide formatted data and transforms it so that we have a single observation per row that we can operate on.

import polars as pl
from polars import col

pl_df = pl.DataFrame({
    "col_1": [*'abca'],
    "col_2": [*'bcdb'],
    "col_3": [*'aaef'],
})

pl_df.melt()['value'].value_counts().sort('count', descending=True)
/tmp/ipykernel_905460/3993633445.py:10: DeprecationWarning: `DataFrame.melt` is deprecated. Use `unpivot` instead, with `index` instead of `id_vars` and `on` instead of `value_vars`
  pl_df.melt()['value'].value_counts().sort('count', descending=True)
shape: (6, 2)
valuecount
stru32
"a"4
"b"3
"c"2
"f"1
"e"1
"d"1

While this approach is convenient, reshaping the data is often considered an expensive operation. So, I wanted to come up with a way to avoid it if possible.

Polars - Untidy Data#

For the untidy approach, we need to perform a .value_counts() on each of the columns individually, align each of those results, and sum across their counts.

While this feels awkward, it is necessary to perform this explicit operation to avoid paying the cost of reshaping the data. Without an implicit index-alignment like we have in pandas, we need to be explicit with our alignment, hence the call to polars.concat on each of the parts.

parts = []
for c in pl_df.columns:
    parts.append(
        pl_df.select(col(c).value_counts()).unnest(c)
        .rename({c: 'value', 'count': c})
    )

(
    pl.concat(parts, how='align')
    .select('value', count=pl.sum_horizontal(pl_df.columns))
)
shape: (6, 2)
valuecount
stru32
"a"4
"b"3
"c"2
"d"1
"e"1
"f"1

pandas#

With pandas, we do not actually need to perform an explicit concatenation. Instead, we can rely on the index-aligned result from using DataFrame.apply(...) (applying a function on each column; not to be confused with DataFrame.apply(..., axis=1)), which also applies the function to each row.

Since the results are aligned against their shared index, we can easily sum horizontally across these elements for the final end result.

pd_df = pl_df.to_pandas()

pd_df.apply(lambda s: s.value_counts()).sum(axis=1)
a    4.0
b    3.0
c    2.0
d    1.0
e    1.0
f    1.0
dtype: float64

Wow. That implicit index-alignment really goes a long way to writing concise code!

Wrap-Up#

There you have it! While this is one of the cases where the pandas.Index becomes very helpful, I know many of us fight with it. Want to know more about the index? Join us on Discord!