Case When: A Welcome Addition to pandas Conveniences#
Hello, everyone! Welcome back to Cameron’s Corner. This week, I want to share
about a relatively new pandas.Series
method: case_when.
This function exists to conveniently make replacements across multiple conditions, but
instead of describing what it does, I’d rather show you. Let’s jump into our premise.
Premise#
Suppose you are a teacher grading papers with six students. They all have different grades, which are the following:
from pandas import DataFrame
df = DataFrame({
'student_id': [1, 2, 3, 4, 5, 6],
'score' : [90, 98, 72, 45, 85, 60],
})
df
student_id | score | |
---|---|---|
0 | 1 | 90 |
1 | 2 | 98 |
2 | 3 | 72 |
3 | 4 | 45 |
4 | 5 | 85 |
5 | 6 | 60 |
After evaluating their scores, you need to classify their grades into mutually exclusive groups based on the following schema:
90 <= score <= 100 → 'A'
80 <= score < 90 → 'B'
70 <= score <= 80 → 'C'
60 <= score <= 70 → 'D'
score <= 60 → 'F'
There are MANY ways to solve this problem, so I wanted to take a moment to outline all of the ways I could think of and discuss why I would or would not use them in my own code.
NumPy Where#
First, I’ll show how we can chain calls to numpy.where
in order to grade our students’
scores. For those who are unfamiliar, numpy.where
can assemble a new array from
a mask and two inputs (either scalars or arrays). For this example, we can
conveniently use where
to indicate whether students passed or failed the class:
from numpy import where
# mask value when mask is True, value when mask is False
pf = where(df['score'] >= 60, 'pass', 'fail')
df.assign(pass_fail=pf)
student_id | score | pass_fail | |
---|---|---|---|
0 | 1 | 90 | pass |
1 | 2 | 98 | pass |
2 | 3 | 72 | pass |
3 | 4 | 45 | fail |
4 | 5 | 85 | pass |
5 | 6 | 60 | pass |
While this is great for binary conditionals, it turns into a hard-to-read mess when nesting conditions. Here, we have to read this code inside out (in the same order that it is evaluated):
Scores less than 60 are given an F, otherwise they receive a NaN
Scores between 60-70 are given a D, otherwise they receive the output from step 1.
Scores between 70-80 are given a C, otherwise they receive the output from step 2.
… (and the pattern continues)
from numpy import where, nan
grades = where(
df['score'].between(90, 100), 'A',
where(df['score'].between(80, 90, inclusive='left'), 'B',
where(df['score'].between(70, 80, inclusive='left'), 'C',
where(df['score'].between(60, 70, inclusive='left'), 'D',
where(df['score'] < 60, 'F', nan)
)
)
)
)
df.assign(grades=grades)
student_id | score | grades | |
---|---|---|---|
0 | 1 | 90 | A |
1 | 2 | 98 | A |
2 | 3 | 72 | C |
3 | 4 | 45 | F |
4 | 5 | 85 | B |
5 | 6 | 60 | D |
While this works computationally, code is read by other people. The nested levels
of indentation make it hard for our brains to parse. Instead, we might be interested
in reaching for a more appropriate NumPy function, which is designed to handle
multiple conditions like numpy.select
NumPy Select#
numpy.select
works by formulating a list of Boolean arrays which each correspond
to some condition. In our case, we form arrays based on the ranges for each grade.
We then pass in a secondary array whose length is equal to the number of cases/conditions
passed and whose values will be filled in when their corresponding condition is True
.
So, in the code below, when the first condition in caselist
is True
, we populate
the array with the first grade in grades
.
from numpy import select
caselist = [
df['score'].between(90, 100),
df['score'].between(80, 90, inclusive='left') ,
df['score'].between(70, 80, inclusive='left') ,
df['score'].between(60, 70, inclusive='left') ,
df['score'] < 60,
]
grades = [*'ABCDF']
df.assign(grade=select(caselist, grades))
student_id | score | grade | |
---|---|---|---|
0 | 1 | 90 | A |
1 | 2 | 98 | A |
2 | 3 | 72 | C |
3 | 4 | 45 | F |
4 | 5 | 85 | B |
5 | 6 | 60 | D |
This solution is much more readable because it allows us to structure each of our conditions nicely. However, the function expects separate conditions and replacement values, which can be a source of small errors when the number of grades and cases do not match up. Of course, we could use a bit more Python syntax to alleviate this pain by doing something like:
caselist = [ # structure conditions paired with replacement
(df['score'].between(90, 100), 'A'),
(df['score'].between(80, 90, inclusive='left') , 'B'),
(df['score'].between(70, 80, inclusive='left') , 'C'),
(df['score'].between(60, 70, inclusive='left') , 'D'),
(df['score'] < 60, 'F'),
]
cases, grades = zip(*caselist) # separate the conditions and grades
df.assign(grade=select(cases, grades))
student_id | score | grade | |
---|---|---|---|
0 | 1 | 90 | A |
1 | 2 | 98 | A |
2 | 3 | 72 | C |
3 | 4 | 45 | F |
4 | 5 | 85 | B |
5 | 6 | 60 | D |
pandas .loc Repeated#
Another way to solve this problem is to use repeated .loc
operations to populate
the array, one condition at a time. This code encourages us to keep our conditions
bound to their replacement values, but it does suffer from some clutter due to the
repetition of s.loc[…] =
pattern.
from pandas import Series, NA
s = Series(NA, index=df.index, dtype=str)
s.loc[df['score'].between(90, 100 )] = 'A'
s.loc[df['score'].between(80, 90, inclusive='left')] = 'B'
s.loc[df['score'].between(70, 80, inclusive='left')] = 'C'
s.loc[df['score'].between(60, 70, inclusive='left')] = 'D'
s.loc[df['score'] < 60 ] = 'F'
df.assign(grade=s)
student_id | score | grade | |
---|---|---|---|
0 | 1 | 90 | A |
1 | 2 | 98 | A |
2 | 3 | 72 | C |
3 | 4 | 45 | F |
4 | 5 | 85 | B |
5 | 6 | 60 | D |
Just like our numpy.select
solution, we can use Python syntax to
make our result more bit more readable by first coupling our conditions and replacements,
then using a for
loop to drive the assignment:
caselist = [
(df['score'].between(90, 100), 'A'),
(df['score'].between(80, 90, inclusive='left') , 'B'),
(df['score'].between(70, 80, inclusive='left') , 'C'),
(df['score'].between(60, 70, inclusive='left') , 'D'),
(df['score'] < 60, 'F'),
]
s = Series(NA, index=df.index)
for mask, gr in caselist:
s.loc[mask] = gr
df.assign(grade=s)
student_id | score | grade | |
---|---|---|---|
0 | 1 | 90 | A |
1 | 2 | 98 | A |
2 | 3 | 72 | C |
3 | 4 | 45 | F |
4 | 5 | 85 | B |
5 | 6 | 60 | D |
pandas case_when#
The function of the hour, pandas.Series.case_when
was added in pandas version 2.2.0 and provides a convenient interface for this
exact type of pattern. I appreciate this convenience method because it encourages
the reasonable thought of how one should structure their conditions and replacement
values in a human-readable way, while also being computationally performant.
caselist = [
(df['score'].between(90, 100), 'A'),
(df['score'].between(80, 90, inclusive='left') , 'B'),
(df['score'].between(70, 80, inclusive='left') , 'C'),
(df['score'].between(60, 70, inclusive='left') , 'D'),
(df['score'] < 60, 'F'),
]
s = Series(NA, index=df.index)
df.assign(grade=s.case_when(caselist))
student_id | score | grade | |
---|---|---|---|
0 | 1 | 90 | A |
1 | 2 | 98 | A |
2 | 3 | 72 | C |
3 | 4 | 45 | F |
4 | 5 | 85 | B |
5 | 6 | 60 | D |
Wrap-Up#
That’s all for this week! While pandas.Series.case_when
has been out for a little
while, I wanted to help spread the word about the active development and new conveniences
that our most veteran Python DataFrame library has to offer.
What do you think about pandas.Series.case_when
? How do you think you’ll implement it in your code? Let me know on the DUTC Discord server.
Talk to you all next week!