pandas: map, pipe, apply explained
Whether you’re new or familiar with pandas, you’ve probably come across the sage advice: don’t use .map or .apply. But as with many adages (especially within programming), we need to also be aware of the context of this advice so we don’t blindly follow advice we found on the internet.
Rigid Restricted Computation Domains
If we’re using a tool like NumPy or pandas, we want our code to run fast. However, to accomplish this, we typically have to sacrifice flexibility. This is the core idea of working within a Restricted Computation Domain and often results in code that looks slightly disjointed from the broader language that the domain is implemented within.
For example, in pure Python, if we want to increment every value in a list, we would probably use a list comprehension.
# just some helper(s) for later
from traceback import print_exc
from contextlib import contextmanager
from sys import stderr
@contextmanager
def short_traceback(limit=2):
try:
yield
except Exception:
print_exc(limit=limit, file=stderr, chain=False)
xs = [1, 2, 3]
[x+1 for x in xs]
[2, 3, 4]
from pandas import Series
s = Series([1, 2, 3])
s + 1 # no explicit for-loop
0 2
1 3
2 4
dtype: int64
We change our code so that we can leverage as much compute power within the pandas domain as possible, so for large numeric arrays, we would expect the pandas implementation to be much faster than the Python list comprehension.
py_xs = [*range(100_000)]
pd_xs = Series(py_xs)
%timeit -n1 -r1 [x+1 for x in py_xs]
%timeit -n1 -r1 pd_xs + 1
5.75 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
433 μs ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
Most pandas operations are fast when they align with built-in, column-level behavior (e.g. numeric math, string methods, datetime manipulation, etc.). But what happens when we have data that’s just strings, and those strings contain a structure that pandas doesn’t understand out of the box? Suddenly, we’re back in Python-land and need tools that bring Python's flexibility into pandas.
Say I want to calculate the number of decades each individual has been alive according to their "age" in the data below. Due to how our data is stored, we won't be able to process everything in "pure pandas".
from io import StringIO
from pandas import read_csv
data = StringIO('''
id,dump
10,"age=42; height=180; weight=75"
32,"age=29; height=165; weight=60"
94,"age=55; weight=82; height=172"
62,"weight=70; age=33; height=178"
28,"height=160; weight=58; age=24"
''')
df = read_csv(data)
df.head()
| id | dump | |
|---|---|---|
| 0 | 10 | age=42; height=180; weight=75 |
| 1 | 32 | age=29; height=165; weight=60 |
| 2 | 94 | age=55; weight=82; height=172 |
| 3 | 62 | weight=70; age=33; height=178 |
| 4 | 28 | height=160; weight=58; age=24 |
with short_traceback():
df['dump'] // 10
The above syntax does not accomplish the process this data in Python, so we need to parse it and store it back in the pandas domain.
from pandas import DataFrame
def parse(entry):
return dict(item.strip().split('=') for item in entry.split(';'))
DataFrame.from_records([parse(entry) for entry in df['dump']])
| age | height | weight | |
|---|---|---|---|
| 0 | 42 | 180 | 75 |
| 1 | 29 | 165 | 60 |
| 2 | 55 | 172 | 82 |
| 3 | 33 | 178 | 70 |
| 4 | 24 | 160 | 58 |
However, the use of a list comprehension obscures the goal of the code by adding in technical detail (e.g. "iterate over the values in df['dump']"). Instead, we typically want our pandas code to only represent the high level logic and when we are already dropping into the Python level this is when .map and .apply become useful.
df['dump'].map(parse).pipe(DataFrame.from_records)
| age | height | weight | |
|---|---|---|---|
| 0 | 42 | 180 | 75 |
| 1 | 29 | 165 | 60 |
| 2 | 55 | 172 | 82 |
| 3 | 33 | 178 | 70 |
| 4 | 24 | 160 | 58 |
The above pandas example is no more performant than previous pandas + list comp example. The primary difference is that our data flow now reads entirely left to right and better highlights the goal of the code rather than the implementation details.
What is .map?
Series.map
If you followed the code above closely, then you’ll probably take a guess that pandas.Series.map is all about calling a user-defined function (UDF) on each value of a given Series. And you would be pretty spot on with that!
from pandas import Series
s = Series([*'abcd'])
s.map(str.upper) # s.str.upper() is the preferred way to do this
0 A
1 B
2 C
3 D
dtype: object
However, pandas also has some added features that you can take advantage of if you need to reach for Series.map. The most useful of which is probably the na_action argument, which provides control over whether pandas.NA or numpy.nan values are passed into the UDF.
from pandas import Series
s = Series(['a', None, 'c', 'd'], dtype='string')
with short_traceback():
s.map(str.upper) # pandas.NA does not support the str.upper function
s.map(str.upper, na_action='ignore') # NAType objects are never passed to the UDF
0 A
1 <NA>
2 C
3 D
dtype: object
In addition to the na_action, one can also pass other types of objects instead of functions to .map! If you pass a dictionary, it will align the keys of that dictionary against the values in the Series, then return the values that correspond to those dictionary keys. Alternatively, if you already have a pandas.Series, you can use .map as a form of a join across the two Series objects.
from pandas import Series, DataFrame
s = Series([*'abcd'])
DataFrame({ # dataframe just to organize the output
'dict': s.map({'a': 0, 'b': 1, 'c': 2}),
'series': s.map(Series([2, 3, 4], index=[*'abc'])),
})
| dict | series | |
|---|---|---|
| 0 | 0.0 | 2.0 |
| 1 | 1.0 | 3.0 |
| 2 | 2.0 | 4.0 |
| 3 | NaN | NaN |
I will mention that these two uses (passing a dictionary and Series) are a bit more esoteric. However, they can streamline some otherwise lengthy code to perform conceptually simple operations
mapper = Series({'a': 0, 'b': 1, 'c': 2}, name='right')
(
s.to_frame('left')
.merge(mapper, left_on='left', right_index=True)
.set_index('left')['right']
.rename().rename_axis(None)
)
a 0
b 1
c 2
dtype: int64
So the primary purpose of Series.map is to call a function on each value of the Series. But what about pandas.DataFrame.map?
DataFrame.map
DataFrame.map is a relatively new addition to the pandas API (since version 2.1.0, before which it was called DataFrame.applymap. The idea here is that we have a DataFrame and we want to call a function on each of the values within that DataFrame.
In the below example, we call str.upper on each scalar value independently and construct a new DataFrame from those results.
from pandas import DataFrame
df = DataFrame({'x': [*'abc'], 'y': [*'def']})
df.map(str.upper)
| x | y | |
|---|---|---|
| 0 | A | D |
| 1 | B | E |
| 2 | C | F |
It also supports the na_action argument which has the same behavior we saw earlier. Pretty simple, rihgt?
By investigating .map at both the Series and DataFrame level we understand the pattern: .map is all about calling a function on the smallest unit of a given container. In pandas, the "smallest unit" is always going to be the immediate values that exist within a Series, which provides a container hierarchy for us to reason about.
DataFrames are containers that house zero or more Series.
Series are containers that house zero or more values.
Values themselves can be of any type, and we do not discern nested containers at this level.
DataFrames → Series → values.
So, if you have a function that needs to work on values within pandas, then .map is going to be your friend. Of course, if you want your code to be as performant as possible, then you should try to refactor your code to avoid user-defined functions and .map entirely.
What is .pipe?
Both the Series and DataFrame have a .pipe method, and thankfully do not have the same degree of input flexibility that .map has. In fact, both Series.pipe and DataFrame.pipe are used to call a function on the object it was passed to.
This directly means that the following are exactly equivalent:
from pandas import Series
def add_one(x):
# x is a Series in this example
return x + 1
s = Series([0, 1, 2])
DataFrame({
'direct call': add_one(s),
'pipe': s.pipe(add_one)
})
| direct call | pipe | |
|---|---|---|
| 0 | 1 | 1 |
| 1 | 2 | 2 |
| 2 | 3 | 3 |
Similarly, its DataFrame counterpart can also be used to keep dataflow reading from left to right. For example, if we have a user-defined function that takes in a DataFrame and outputs a new DataFrame we can either call this function directly on its input or we can use DataFrame.pipe
from pandas import DataFrame, concat
def clean(df):
"""Dont worry there is a more coherent way to write this using `.apply`
"""
string_cols = df.select_dtypes('object').columns
remaining_cols = df.columns.difference(string_cols)
clean_string_cols = (df[s].str.strip() for s in string_cols)
remaining_df = df[remaining_cols]
return (
concat([*clean_string_cols, remaining_df], axis=1) # recreate DataFrame from processed parts
.reindex(columns=df.columns) # use original column ordering
.rename(columns=str.upper) # upper case column names
)
df = DataFrame({
'a': [1, 2, 3],
'b': [' hello world', 'extra spaces ', '...']
})
display(
df.pipe(clean).map(repr),
df.pipe(clean).equals(clean(df)) # produces the same output!
)
| A | B | |
|---|---|---|
| 0 | 1 | 'hello world' |
| 1 | 2 | 'extra spaces' |
| 2 | 3 | '...' |
True
And that’s the entire story of .pipe. It is probably the most straightforward method that we will cover today! Typically, I use .pipe in order to call functions on the current state of a given DataFrame, allowing me to maintain an ongoing method chain without needing to break it. Additionally, I'll use .pipe as a crutch for other pandas methods that do not support method-chained input, which you can read more about in a previous blog post
What is .apply?
Just like map, both DataFrames and Series have a .apply method.
In a manner that .map allows us to operate on the smallest unit of a given pandas object, .apply allows us to operate on a unit that exists one level beneath the current. Referring back to our container hierarchy
DataFrames → Series → values.
The above implies that DataFrame.apply allows us to operate on individual Series objects. And Series.apply allows us to operate on the values within a Series.
Series.apply
But wait a minute, we already have a method to call a function on the values in a Series. It was .map, why do I have another way to accomplish this? Honestly, I don't think this method should exist as it adds some confusion and introduces some potential footguns. Let's first revisit what we’ve seen in order to choose the most appropriate function calling approach. Looking at the below code, we have a function that evaluates x + 1, and since Python is a duck-typed language, you'll probably notice that we can call this function on a single value (e.g., 100) or an entire Series.
If we invoke this function via .map, then we will make a 1 function call per-value within the Series. However, if we invoke this function via .pipe we only call the function once, and pandas takes care of hard part of the computation for us. We can see the massive difference performance that the repeated function calls (and Python-level iteration) have on our code.
from pandas import Series
def add_one(x):
return x + 1
s = Series([*range(100_000)])
%timeit -n1 -r1 s.map(add_one)
%timeit -n1 -r1 s.pipe(add_one)
30.3 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
336 μs ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
So if Series.apply is most similar to Series.map then we should see similar performance.
%timeit -n1 -r1 s.apply(add_one)
%timeit -n1 -r1 s.apply(add_one, by_row=False)
29.9 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
405 μs ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
But wait, what’s that magic argument by_row=False? Does that speed all computations up? Nope! Here we can see redundancy of Series.apply all that the by_row argument does is toggle whether we call the function once per value (i.e., .map) or do we pass the entire series into the function (i.e., .pipe).
In fact, by default, Series.apply attempts to uncover if your passed function can be called on the entire Series.
from numpy import add as np_add
s = Series([*range(100_000)])
%timeit -n1 -r1 s.apply(np_add, args=(1,))
%timeit -n1 -r1 s.apply(np_add, args=(1,), by_row=False)
834 μs ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
684 μs ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
But I would not rely on pandas being able guess the nature of the passed function. If you want to write more idiomatic pandas code, I would avoid Series.apply entirely. You have more specific methods to reach for the scenarios at hand.
DataFrame.apply
Now this is a much more useful and interesting method! Because this is how we can operate on individual Series objects that exist within a DataFrame. This method gets very bad publicity because the internet decided to publicize that .apply is slow primarily because of the above Series.apply example and the following DataFrame.apply(…, axis=1) that we will take a look at.
from pandas import DataFrame
df = DataFrame({
'a': ['some strings', ' MORE STRINGS '],
'b': [' hello world', 'extra spaces '],
})
df.map(repr) # map(repr) to show the white space at the beginning/end of strings
| a | b | |
|---|---|---|
| 0 | 'some strings' | ' hello world' |
| 1 | ' MORE STRINGS ' | 'extra spaces ' |
If I want to convert all of these values to lowercase and strip whitespace, you may think that I should reach for .map since I need to operate on every value in the DataFrame. However, I should always take advantage of the vectorized operations that pandas has access to. The recently added PyArrow backend has some incredibly fast string implementations, which we can only access via the pandas.Series interface.
The following examples produce the same output:
df.map(lambda v: v.strip().lower()) # call function on each value
| a | b | |
|---|---|---|
| 0 | some strings | hello world |
| 1 | more strings | extra spaces |
df.apply(lambda s: s.str.strip().str.lower()) # call function on each series
| a | b | |
|---|---|---|
| 0 | some strings | hello world |
| 1 | more strings | extra spaces |
But if we increase the dataset size and time the computation, you'll start to see some massive differences.
from numpy.random import default_rng
from string import ascii_uppercase
rng = default_rng(0)
strings = (
rng.choice([*ascii_uppercase], size=(3_000_000, length := 10), replace=True)
.view(f'<U{length}')
.ravel()
)
large_df = DataFrame({
's1': Series(strings, dtype='string[pyarrow]'),
's2': Series(strings, dtype='string[pyarrow]').sample(frac=1),
})
large_df.tail()
| s1 | s2 | |
|---|---|---|
| 2999995 | UUPNPBZKEP | UUPNPBZKEP |
| 2999996 | YIVFZJWOJC | YIVFZJWOJC |
| 2999997 | UTURFYOZKV | UTURFYOZKV |
| 2999998 | KYHWATEASI | KYHWATEASI |
| 2999999 | VJXHXYLOFS | VJXHXYLOFS |
%timeit -n1 -r1 large_df.map(lambda v: v.strip().lower()) # call function on each value
%timeit -n1 -r1 large_df.apply(lambda s: s.str.strip().str.lower()) # call function on each series
1.49 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
222 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
That said, DataFrame.apply also has one of the largest foot guns that new users might encounter. You think you want to iterate along the rows of the DataFrame to apply some custom logic. This will almost ALWAYS be the computationally slowest thing you can do.
from pandas import DataFrame
df = DataFrame({
'a': ['some strings', ' MORE STRINGS '],
'b': [' hello world', 'extra spaces '],
})
df.apply(lambda row: row['a'].lower().strip(), axis=1)
0 some strings
1 more strings
dtype: object
That said, what is slow for millions of rows is relatively fast for hundreds of rows, so if you can't refactor your problem to avoid .apply(…, axis=1) and your code runs with a satisfactory speed, then you can move on to a more important problem.
Wrap-Up
Whew! That has been a lot of pandas code.
.map, .pipe, and .apply allow us to call user-defined functions on our pandas objects while targeting different levels of the container hierarchy. Don't forget that the core pandas objects are containers
DataFrame → Series → Values
And conceptually:
.map functions are ALWAYS fed the Values as input
.pipe functions are fed input of the same level of the hierarchy
specifically, these functions operate on the object that owns .pipe.
.apply functions are fed input that is one level lower in the hierarchy
DataFrame.apply receives Series as its inputs
Series.apply receives Values as its inputs (just use .map)
What are your thoughts? Let me know on the DUTC Discord server!