Table of Contents

pandas: map, pipe, apply explained

Whether you’re new or familiar with pandas, you’ve probably come across the sage advice: don’t use .map or .apply. But as with many adages (especially within programming), we need to also be aware of the context of this advice so we don’t blindly follow advice we found on the internet.

Rigid Restricted Computation Domains

If we’re using a tool like NumPy or pandas, we want our code to run fast. However, to accomplish this, we typically have to sacrifice flexibility. This is the core idea of working within a Restricted Computation Domain and often results in code that looks slightly disjointed from the broader language that the domain is implemented within.

For example, in pure Python, if we want to increment every value in a list, we would probably use a list comprehension.

# just some helper(s) for later
from traceback import print_exc
from contextlib import contextmanager
from sys import stderr

@contextmanager
def short_traceback(limit=2):
    try:
        yield
    except Exception:
        print_exc(limit=limit, file=stderr, chain=False)
xs = [1, 2, 3]
[x+1 for x in xs]
[2, 3, 4]
from pandas import Series
s = Series([1, 2, 3])
s + 1 # no explicit for-loop
0    2
1    3
2    4
dtype: int64

We change our code so that we can leverage as much compute power within the pandas domain as possible, so for large numeric arrays, we would expect the pandas implementation to be much faster than the Python list comprehension.

py_xs = [*range(100_000)]
pd_xs = Series(py_xs)

%timeit -n1 -r1 [x+1 for x in py_xs]
%timeit -n1 -r1 pd_xs + 1
5.75 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
433 μs ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

Most pandas operations are fast when they align with built-in, column-level behavior (e.g. numeric math, string methods, datetime manipulation, etc.). But what happens when we have data that’s just strings, and those strings contain a structure that pandas doesn’t understand out of the box? Suddenly, we’re back in Python-land and need tools that bring Python's flexibility into pandas.

Say I want to calculate the number of decades each individual has been alive according to their "age" in the data below. Due to how our data is stored, we won't be able to process everything in "pure pandas".

from io import StringIO
from pandas import read_csv

data = StringIO('''
    id,dump
    10,"age=42; height=180; weight=75"
    32,"age=29; height=165; weight=60"
    94,"age=55; weight=82; height=172"
    62,"weight=70; age=33; height=178"
    28,"height=160; weight=58; age=24"
''')

df = read_csv(data)
df.head()
id dump
0 10 age=42; height=180; weight=75
1 32 age=29; height=165; weight=60
2 94 age=55; weight=82; height=172
3 62 weight=70; age=33; height=178
4 28 height=160; weight=58; age=24
with short_traceback():
    df['dump'] // 10

The above syntax does not accomplish the process this data in Python, so we need to parse it and store it back in the pandas domain.

from pandas import DataFrame

def parse(entry):
    return dict(item.strip().split('=') for item in entry.split(';'))

DataFrame.from_records([parse(entry) for entry in df['dump']])
age height weight
0 42 180 75
1 29 165 60
2 55 172 82
3 33 178 70
4 24 160 58

However, the use of a list comprehension obscures the goal of the code by adding in technical detail (e.g. "iterate over the values in df['dump']"). Instead, we typically want our pandas code to only represent the high level logic and when we are already dropping into the Python level this is when .map and .apply become useful.

df['dump'].map(parse).pipe(DataFrame.from_records)
age height weight
0 42 180 75
1 29 165 60
2 55 172 82
3 33 178 70
4 24 160 58

The above pandas example is no more performant than previous pandas + list comp example. The primary difference is that our data flow now reads entirely left to right and better highlights the goal of the code rather than the implementation details.

What is .map?

Series.map

If you followed the code above closely, then you’ll probably take a guess that pandas.Series.map is all about calling a user-defined function (UDF) on each value of a given Series. And you would be pretty spot on with that!

from pandas import Series

s = Series([*'abcd'])
s.map(str.upper) # s.str.upper() is the preferred way to do this
0    A
1    B
2    C
3    D
dtype: object

However, pandas also has some added features that you can take advantage of if you need to reach for Series.map. The most useful of which is probably the na_action argument, which provides control over whether pandas.NA or numpy.nan values are passed into the UDF.

from pandas import Series

s = Series(['a', None, 'c', 'd'], dtype='string')
with short_traceback():
    s.map(str.upper) # pandas.NA does not support the str.upper function
s.map(str.upper, na_action='ignore') # NAType objects are never passed to the UDF
0       A
1    <NA>
2       C
3       D
dtype: object

In addition to the na_action, one can also pass other types of objects instead of functions to .map! If you pass a dictionary, it will align the keys of that dictionary against the values in the Series, then return the values that correspond to those dictionary keys. Alternatively, if you already have a pandas.Series, you can use .map as a form of a join across the two Series objects.

from pandas import Series, DataFrame

s = Series([*'abcd'])

DataFrame({ # dataframe just to organize the output
    'dict': s.map({'a': 0, 'b': 1, 'c': 2}),
    'series': s.map(Series([2, 3, 4], index=[*'abc'])),
})
dict series
0 0.0 2.0
1 1.0 3.0
2 2.0 4.0
3 NaN NaN

I will mention that these two uses (passing a dictionary and Series) are a bit more esoteric. However, they can streamline some otherwise lengthy code to perform conceptually simple operations

mapper = Series({'a': 0, 'b': 1, 'c': 2}, name='right')
(
    s.to_frame('left')
    .merge(mapper, left_on='left', right_index=True)
    .set_index('left')['right']
    .rename().rename_axis(None)
)
a    0
b    1
c    2
dtype: int64

So the primary purpose of Series.map is to call a function on each value of the Series. But what about pandas.DataFrame.map?

DataFrame.map

DataFrame.map is a relatively new addition to the pandas API (since version 2.1.0, before which it was called DataFrame.applymap. The idea here is that we have a DataFrame and we want to call a function on each of the values within that DataFrame.

In the below example, we call str.upper on each scalar value independently and construct a new DataFrame from those results.

from pandas import DataFrame

df = DataFrame({'x': [*'abc'], 'y': [*'def']})
df.map(str.upper)
x y
0 A D
1 B E
2 C F

It also supports the na_action argument which has the same behavior we saw earlier. Pretty simple, rihgt?

By investigating .map at both the Series and DataFrame level we understand the pattern: .map is all about calling a function on the smallest unit of a given container. In pandas, the "smallest unit" is always going to be the immediate values that exist within a Series, which provides a container hierarchy for us to reason about.

  1. DataFrames are containers that house zero or more Series.

  2. Series are containers that house zero or more values.

  3. Values themselves can be of any type, and we do not discern nested containers at this level.

DataFrames → Series → values.

So, if you have a function that needs to work on values within pandas, then .map is going to be your friend. Of course, if you want your code to be as performant as possible, then you should try to refactor your code to avoid user-defined functions and .map entirely.

What is .pipe?

Both the Series and DataFrame have a .pipe method, and thankfully do not have the same degree of input flexibility that .map has. In fact, both Series.pipe and DataFrame.pipe are used to call a function on the object it was passed to.

This directly means that the following are exactly equivalent:

from pandas import Series

def add_one(x):
    # x is a Series in this example
    return x + 1

s = Series([0, 1, 2])
DataFrame({
    'direct call': add_one(s),
    'pipe': s.pipe(add_one)
})
direct call pipe
0 1 1
1 2 2
2 3 3

Similarly, its DataFrame counterpart can also be used to keep dataflow reading from left to right. For example, if we have a user-defined function that takes in a DataFrame and outputs a new DataFrame we can either call this function directly on its input or we can use DataFrame.pipe

from pandas import DataFrame, concat

def clean(df):
    """Dont worry there is a more coherent way to write this using `.apply`
    """
    string_cols = df.select_dtypes('object').columns
    remaining_cols = df.columns.difference(string_cols)

    clean_string_cols = (df[s].str.strip() for s in string_cols)
    remaining_df = df[remaining_cols]

    return (
        concat([*clean_string_cols, remaining_df], axis=1) # recreate DataFrame from processed parts
        .reindex(columns=df.columns) # use original column ordering
        .rename(columns=str.upper)   # upper case column names
    )

        

df = DataFrame({
    'a': [1, 2, 3],
    'b': ['  hello world', 'extra spaces  ', '...']
})

display(
    df.pipe(clean).map(repr),
    df.pipe(clean).equals(clean(df)) # produces the same output!
)
A B
0 1 'hello world'
1 2 'extra spaces'
2 3 '...'
True

And that’s the entire story of .pipe. It is probably the most straightforward method that we will cover today! Typically, I use .pipe in order to call functions on the current state of a given DataFrame, allowing me to maintain an ongoing method chain without needing to break it. Additionally, I'll use .pipe as a crutch for other pandas methods that do not support method-chained input, which you can read more about in a previous blog post

What is .apply?

Just like map, both DataFrames and Series have a .apply method.

In a manner that .map allows us to operate on the smallest unit of a given pandas object, .apply allows us to operate on a unit that exists one level beneath the current. Referring back to our container hierarchy

DataFrames → Series → values.

The above implies that DataFrame.apply allows us to operate on individual Series objects. And Series.apply allows us to operate on the values within a Series.

Series.apply

But wait a minute, we already have a method to call a function on the values in a Series. It was .map, why do I have another way to accomplish this? Honestly, I don't think this method should exist as it adds some confusion and introduces some potential footguns. Let's first revisit what we’ve seen in order to choose the most appropriate function calling approach. Looking at the below code, we have a function that evaluates x + 1, and since Python is a duck-typed language, you'll probably notice that we can call this function on a single value (e.g., 100) or an entire Series.

If we invoke this function via .map, then we will make a 1 function call per-value within the Series. However, if we invoke this function via .pipe we only call the function once, and pandas takes care of hard part of the computation for us. We can see the massive difference performance that the repeated function calls (and Python-level iteration) have on our code.

from pandas import Series

def add_one(x):
    return x + 1

s = Series([*range(100_000)])

%timeit -n1 -r1 s.map(add_one)
%timeit -n1 -r1 s.pipe(add_one)
30.3 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
336 μs ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

So if Series.apply is most similar to Series.map then we should see similar performance.

%timeit -n1 -r1 s.apply(add_one)
%timeit -n1 -r1 s.apply(add_one, by_row=False)
29.9 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
405 μs ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

But wait, what’s that magic argument by_row=False? Does that speed all computations up? Nope! Here we can see redundancy of Series.apply all that the by_row argument does is toggle whether we call the function once per value (i.e., .map) or do we pass the entire series into the function (i.e., .pipe).

In fact, by default, Series.apply attempts to uncover if your passed function can be called on the entire Series.

from numpy import add as np_add

s = Series([*range(100_000)])

%timeit -n1 -r1 s.apply(np_add, args=(1,))
%timeit -n1 -r1 s.apply(np_add, args=(1,), by_row=False)
834 μs ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
684 μs ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

But I would not rely on pandas being able guess the nature of the passed function. If you want to write more idiomatic pandas code, I would avoid Series.apply entirely. You have more specific methods to reach for the scenarios at hand.

DataFrame.apply

Now this is a much more useful and interesting method! Because this is how we can operate on individual Series objects that exist within a DataFrame. This method gets very bad publicity because the internet decided to publicize that .apply is slow primarily because of the above Series.apply example and the following DataFrame.apply(…, axis=1) that we will take a look at.

from pandas import DataFrame

df = DataFrame({
    'a': ['some strings', '  MORE STRINGS   '],
    'b': ['  hello world', 'extra spaces  '],
})

df.map(repr) # map(repr) to show the white space at the beginning/end of strings
a b
0 'some strings' ' hello world'
1 ' MORE STRINGS ' 'extra spaces '

If I want to convert all of these values to lowercase and strip whitespace, you may think that I should reach for .map since I need to operate on every value in the DataFrame. However, I should always take advantage of the vectorized operations that pandas has access to. The recently added PyArrow backend has some incredibly fast string implementations, which we can only access via the pandas.Series interface.

The following examples produce the same output:

df.map(lambda v: v.strip().lower()) # call function on each value
a b
0 some strings hello world
1 more strings extra spaces
df.apply(lambda s: s.str.strip().str.lower()) # call function on each series
a b
0 some strings hello world
1 more strings extra spaces

But if we increase the dataset size and time the computation, you'll start to see some massive differences.

from numpy.random import default_rng
from string import ascii_uppercase

rng = default_rng(0)
strings = (
    rng.choice([*ascii_uppercase], size=(3_000_000, length := 10), replace=True)
    .view(f'<U{length}')
    .ravel()
)

large_df = DataFrame({
    's1': Series(strings, dtype='string[pyarrow]'),
    's2': Series(strings, dtype='string[pyarrow]').sample(frac=1),
})
large_df.tail()
s1 s2
2999995 UUPNPBZKEP UUPNPBZKEP
2999996 YIVFZJWOJC YIVFZJWOJC
2999997 UTURFYOZKV UTURFYOZKV
2999998 KYHWATEASI KYHWATEASI
2999999 VJXHXYLOFS VJXHXYLOFS
%timeit -n1 -r1 large_df.map(lambda v: v.strip().lower())           # call function on each value
%timeit -n1 -r1 large_df.apply(lambda s: s.str.strip().str.lower()) # call function on each series
1.49 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
222 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

That said, DataFrame.apply also has one of the largest foot guns that new users might encounter. You think you want to iterate along the rows of the DataFrame to apply some custom logic. This will almost ALWAYS be the computationally slowest thing you can do.

from pandas import DataFrame

df = DataFrame({
    'a': ['some strings', '  MORE STRINGS   '],
    'b': ['  hello world', 'extra spaces  '],
})

df.apply(lambda row: row['a'].lower().strip(), axis=1)
0    some strings
1    more strings
dtype: object

That said, what is slow for millions of rows is relatively fast for hundreds of rows, so if you can't refactor your problem to avoid .apply(…, axis=1) and your code runs with a satisfactory speed, then you can move on to a more important problem.

Wrap-Up

Whew! That has been a lot of pandas code.

.map, .pipe, and .apply allow us to call user-defined functions on our pandas objects while targeting different levels of the container hierarchy. Don't forget that the core pandas objects are containers

DataFrame → Series → Values

And conceptually:

  • .map functions are ALWAYS fed the Values as input

  • .pipe functions are fed input of the same level of the hierarchy

    • specifically, these functions operate on the object that owns .pipe.

  • .apply functions are fed input that is one level lower in the hierarchy

    • DataFrame.apply receives Series as its inputs

    • Series.apply receives Values as its inputs (just use .map)

What are your thoughts? Let me know on the DUTC Discord server!

Table of Contents
Table of Contents