Case When: A Welcome Addition to pandas Conveniences
Hello, everyone! Welcome back to Cameron's Corner. This week, I want to share about a relatively new pandas.Series method: case_when. This function exists to conveniently make replacements across multiple conditions, but instead of describing what it does, I'd rather show you. Let's jump into our premise.
Premise
Suppose you are a teacher grading papers with six students. They all have different grades, which are the following:
from pandas import DataFrame
df = DataFrame({
'student_id': [1, 2, 3, 4, 5, 6],
'score' : [90, 98, 72, 45, 85, 60],
})
df
student_id | score | |
---|---|---|
0 | 1 | 90 |
1 | 2 | 98 |
2 | 3 | 72 |
3 | 4 | 45 |
4 | 5 | 85 |
5 | 6 | 60 |
After evaluating their scores, you need to classify their grades into mutually exclusive groups based on the following schema:
90 <= score <= 100 → 'A'
80 <= score < 90 → 'B'
70 <= score <= 80 → 'C'
60 <= score <= 70 → 'D'
score <= 60 → 'F'
There are MANY ways to solve this problem, so I wanted to take a moment to outline all of the ways I could think of and discuss why I would or would not use them in my own code.
NumPy Where
First, I'll show how we can chain calls to numpy.where in order to grade our students' scores. For those who are unfamiliar, numpy.where can assemble a new array from a mask and two inputs (either scalars or arrays). For this example, we can conveniently use where to indicate whether students passed or failed the class:
from numpy import where
# mask value when mask is True, value when mask is False
pf = where(df['score'] >= 60, 'pass', 'fail')
df.assign(pass_fail=pf)
student_id | score | pass_fail | |
---|---|---|---|
0 | 1 | 90 | pass |
1 | 2 | 98 | pass |
2 | 3 | 72 | pass |
3 | 4 | 45 | fail |
4 | 5 | 85 | pass |
5 | 6 | 60 | pass |
While this is great for binary conditionals, it turns into a hard-to-read mess when nesting conditions. Here, we have to read this code inside out (in the same order that it is evaluated):
Scores less than 60 are given an F, otherwise they receive a NaN
Scores between 60-70 are given a D, otherwise they receive the output from step 1.
Scores between 70-80 are given a C, otherwise they receive the output from step 2.
... (and the pattern continues)
from numpy import where
from pandas import NA
grades = where(
df['score'].between(90, 100), 'A',
where(df['score'].between(80, 90, inclusive='left'), 'B',
where(df['score'].between(70, 80, inclusive='left'), 'C',
where(df['score'].between(60, 70, inclusive='left'), 'D',
where(df['score'] < 60, 'F', NA)
)
)
)
)
df.assign(grades=grades)
student_id | score | grades | |
---|---|---|---|
0 | 1 | 90 | A |
1 | 2 | 98 | A |
2 | 3 | 72 | C |
3 | 4 | 45 | F |
4 | 5 | 85 | B |
5 | 6 | 60 | D |
While this works computationally, code is read by other people. The nested levels of indentation make it hard for our brains to parse. Instead, we might be interested in reaching for a more appropriate NumPy function, which is designed to handle multiple conditions like numpy.select
NumPy Select
numpy.select works by formulating a list of Boolean arrays which each correspond to some condition. In our case, we form arrays based on the ranges for each grade. We then pass in a secondary array whose length is equal to the number of cases/conditions passed and whose values will be filled in when their corresponding condition is True. So, in the code below, when the first condition in caselist is True, we populate the array with the first grade in grades.
from numpy import select
caselist = [
df['score'].between(90, 100),
df['score'].between(80, 90, inclusive='left') ,
df['score'].between(70, 80, inclusive='left') ,
df['score'].between(60, 70, inclusive='left') ,
df['score'] < 60,
]
grades = [*'ABCDF']
df.assign(grade=select(caselist, grades, default=None))
student_id | score | grade | |
---|---|---|---|
0 | 1 | 90 | A |
1 | 2 | 98 | A |
2 | 3 | 72 | C |
3 | 4 | 45 | F |
4 | 5 | 85 | B |
5 | 6 | 60 | D |
This solution is much more readable because it allows us to structure each of our conditions nicely. However, the function expects separate conditions and replacement values, which can be a source of small errors when the number of grades and cases do not match up. Of course, we could use a bit more Python syntax to alleviate this pain by doing something like:
caselist = [ # structure conditions paired with replacement
(df['score'].between(90, 100), 'A'),
(df['score'].between(80, 90, inclusive='left') , 'B'),
(df['score'].between(70, 80, inclusive='left') , 'C'),
(df['score'].between(60, 70, inclusive='left') , 'D'),
(df['score'] < 60, 'F'),
]
cases, grades = zip(*caselist) # separate the conditions and grades
df.assign(grade=select(cases, grades, default=None))
student_id | score | grade | |
---|---|---|---|
0 | 1 | 90 | A |
1 | 2 | 98 | A |
2 | 3 | 72 | C |
3 | 4 | 45 | F |
4 | 5 | 85 | B |
5 | 6 | 60 | D |
pandas .loc Repeated
Another way to solve this problem is to use repeated .loc operations to populate the array, one condition at a time. This code encourages us to keep our conditions bound to their replacement values, but it does suffer from some clutter due to the repetition of s.loc[…] = pattern.
from pandas import Series, NA
s = Series(NA, index=df.index, dtype=str)
s.loc[df['score'].between(90, 100 )] = 'A'
s.loc[df['score'].between(80, 90, inclusive='left')] = 'B'
s.loc[df['score'].between(70, 80, inclusive='left')] = 'C'
s.loc[df['score'].between(60, 70, inclusive='left')] = 'D'
s.loc[df['score'] < 60 ] = 'F'
df.assign(grade=s)
student_id | score | grade | |
---|---|---|---|
0 | 1 | 90 | A |
1 | 2 | 98 | A |
2 | 3 | 72 | C |
3 | 4 | 45 | F |
4 | 5 | 85 | B |
5 | 6 | 60 | D |
Just like our numpy.select solution, we can use Python syntax to make our result more bit more readable by first coupling our conditions and replacements, then using a for loop to drive the assignment:
caselist = [
(df['score'].between(90, 100), 'A'),
(df['score'].between(80, 90, inclusive='left') , 'B'),
(df['score'].between(70, 80, inclusive='left') , 'C'),
(df['score'].between(60, 70, inclusive='left') , 'D'),
(df['score'] < 60, 'F'),
]
s = Series(NA, index=df.index)
for mask, gr in caselist:
s.loc[mask] = gr
df.assign(grade=s)
student_id | score | grade | |
---|---|---|---|
0 | 1 | 90 | A |
1 | 2 | 98 | A |
2 | 3 | 72 | C |
3 | 4 | 45 | F |
4 | 5 | 85 | B |
5 | 6 | 60 | D |
pandas case_when
The function of the hour, pandas.Series.case_when was added in pandas version 2.2.0 and provides a convenient interface for this exact type of pattern. I appreciate this convenience method because it encourages the reasonable thought of how one should structure their conditions and replacement values in a human-readable way, while also being computationally performant.
caselist = [
(df['score'].between(90, 100), 'A'),
(df['score'].between(80, 90, inclusive='left') , 'B'),
(df['score'].between(70, 80, inclusive='left') , 'C'),
(df['score'].between(60, 70, inclusive='left') , 'D'),
(df['score'] < 60, 'F'),
]
s = Series(NA, index=df.index)
df.assign(grade=s.case_when(caselist))
student_id | score | grade | |
---|---|---|---|
0 | 1 | 90 | A |
1 | 2 | 98 | A |
2 | 3 | 72 | C |
3 | 4 | 45 | F |
4 | 5 | 85 | B |
5 | 6 | 60 | D |
Wrap-Up
That's all for this week! While pandas.Series.case_when has been out for a little while, I wanted to help spread the word about the active development and new conveniences that our most veteran Python DataFrame library has to offer.
What do you think about pandas.Series.case_when? How do you think you'll implement it in your code? Let me know on the DUTC Discord server.
Talk to you all next week!