When the .index is convenient#
The blazingly-fast DataFrame library, Polars, has a huge conceptual difference from the
DataFrame veteran, pandas: pandas is
ALL about working with a consistent index
, whereas Polars forces individuals
to work more explicitly using join
s.
I came across a question on Stack Overflow that provided a great example of the benefits of working in an index-aligned way.
The Question#
I have found how to count the values of a specified column, but I can’t find a way to do the same thing for the entire dataframe. I could count the values of each column and then combine them afterward, but it is a bit clunky and I was looking for a more elegant solution.
While there was no minimal reproducible example, here is one that was created on behalf of the original post.
input:
col_1 col_2 col_3
"a" "b" "a"
"b" "c" "a"
"c" "d" "e"
"a" "b" "f"
expected output:
value count
"a" 4
"b" 3
"c" 2
"d" 1
"e" 1
"f" 1
Essentially, we want to take the value counts across each column and combine those results.
Polars - Tidy Data#
The simplest approach in Polars requires reshaping the data. A melt
operation
here takes our wide formatted data and transforms it so that we have a single
observation per row that we can operate on.
import polars as pl
from polars import col
pl_df = pl.DataFrame({
"col_1": [*'abca'],
"col_2": [*'bcdb'],
"col_3": [*'aaef'],
})
pl_df.melt()['value'].value_counts().sort('count', descending=True)
/tmp/ipykernel_905460/3993633445.py:10: DeprecationWarning: `DataFrame.melt` is deprecated. Use `unpivot` instead, with `index` instead of `id_vars` and `on` instead of `value_vars`
pl_df.melt()['value'].value_counts().sort('count', descending=True)
value | count |
---|---|
str | u32 |
"a" | 4 |
"b" | 3 |
"c" | 2 |
"f" | 1 |
"e" | 1 |
"d" | 1 |
While this approach is convenient, reshaping the data is often considered an expensive operation. So, I wanted to come up with a way to avoid it if possible.
Polars - Untidy Data#
For the untidy approach, we need to perform a .value_counts()
on each of the
columns individually, align each of those results, and sum across their counts.
While this feels awkward, it is necessary to perform this explicit operation
to avoid paying the cost of reshaping the data. Without an implicit index-alignment
like we have in pandas, we need to be explicit with our alignment, hence the
call to polars.concat
on each of the parts.
parts = []
for c in pl_df.columns:
parts.append(
pl_df.select(col(c).value_counts()).unnest(c)
.rename({c: 'value', 'count': c})
)
(
pl.concat(parts, how='align')
.select('value', count=pl.sum_horizontal(pl_df.columns))
)
value | count |
---|---|
str | u32 |
"a" | 4 |
"b" | 3 |
"c" | 2 |
"d" | 1 |
"e" | 1 |
"f" | 1 |
pandas#
With pandas, we do not actually need to perform an explicit concatenation.
Instead, we can rely on the index-aligned result from using DataFrame.apply(...)
(applying
a function on each column; not to be confused with DataFrame.apply(..., axis=1)
),
which also applies the function to each row.
Since the results are aligned against their shared index, we can easily sum horizontally across these elements for the final end result.
pd_df = pl_df.to_pandas()
pd_df.apply(lambda s: s.value_counts()).sum(axis=1)
a 4.0
b 3.0
c 2.0
d 1.0
e 1.0
f 1.0
dtype: float64
Wow. That implicit index-alignment really goes a long way to writing concise code!
Wrap-Up#
There you have it! While this is one of the cases where the pandas.Index
becomes
very helpful, I know many of us fight with it. Want to know more about the
index? Join us on Discord!