Table of Contents

Coming Soon: pandas.col

An old dog can learn new tricks, and just because pandas is the most veteran Python DataFrame library, it doesn't mean that it can't refine and add APIs that users love.

It's no secret that Polars has taken the Python DataFrame world by storm, boasting its incredible speed and built-in query optimizer. However, that doesn't explain why it became such an approachable tool. Beyond its speed, Polars' expression syntax gives users a concise way to reference columns in a DataFrame and chain operations on them.

If users love the expression system that Polars introduced, why not bring some of that goodness to pandas and add an intuitive way to interact with your pandas code? At least that's what Marco Gorelli, author of Narwhals and a maintainer of both pandas and Polars, asked in a recently merged Pull Request to pandas. (If you want to see the amazing code you can write using this new feature, then you should check out Marco's blogpost Expressions are coming to pandas!.)

So will pandas have full-on expressions?

That's a great question, thank you for asking. The answer is a bit mixed. As of right now, pandas is going to support one expression-like function: col, which lets users easily select and manipulate individual columns wherever pandas accepts a function as its input.

Today, I want to implement the col function to show you how it operates under the hood.

Why col?

We've been given a blank canvas, we want to implement col in pandas, but where do we even start? How should it behave and how should it interact with pandas’ ecosystem? When working on new features, it helps to clarify two aspects:

  1. How will it be used (in what contexts will it be called)?

  2. What are its inputs & outputs (typing)?

Consider these examples:

df.loc[col('a') <= 2] # returns df, subsetted where values in column a are <= 2
df.assign(
    b=col('a') ** 2  # creates new column `b` which is equivalent to column `a` squared
)

Of course, the above isn't executable, but at least we have a starting point with our usage of col: we want it to take a string (the name of a column) and return a function that can then be ingested by pandas methods.

As you may know, many pandas methods already accept callables, functions that return a Series, DataFrame, or scalar. This will be our foothold on which we'll strap the col's behavior.

from pandas import DataFrame

df = DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})

(
    df
    .loc[lambda d: d['a'] <= 2]      # filter rows based on the result of function
    .assign(c=lambda d: d['b'] ** 2) # add column based on result of function
)
a b c
0 1 4 16
1 2 5 25

Making col Callable

From here, we can easily link this to an implementation of col by having it return something callable.

from typing import Callable
from pandas import DataFrame, Series, Timestamp

# Just for brevity, probably not how we would type this in a production system
Scalar = int | float | str | Timestamp

def col(name: str) -> Callable[[DataFrame], DataFrame | Series | Scalar]:
    def function(df):
        return df[name]
    return function

df = DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})
df.assign(c=col('a'))
a b c
0 1 4 1
1 2 5 2
2 3 6 3

Supporting Expressions

While this works, we will find that it does not magically have the flexibility that we want to impart. What if we want to do something like this?

df.assign(c=col('a') ** 2)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[3], line 1
----> 1 df.assign(c=col('a') ** 2)

TypeError: unsupported operand type(s) for ** or pow(): 'function' and 'int'

Clearly, returning just a function is not enough. We need to return an object that can interact with Python’s rich syntax in the same manner as a pandas.Series object. In this case, we need to return something more complex than a regular function. This returned object will need to store procedures to call against the designated column.

from dataclasses import dataclass, field

@dataclass
class Expr:
    _stack: list['Expr'] = field(default_factory=list)

    def __call__(self, df):
        """traverse stored procedures and invoke them, passing the result from the
        previous method as the input of the current method
        """
        for func in self._stack:
            df = func(df)
        return df

    def __pow__(self, other):
        # instead of calling immediately, we add to our stack to be called later
        self._stack.append(lambda d: d ** other)
        return self


def col(name: str) -> Callable[[DataFrame], Expr]:
    return Expr([lambda df: df[name]])


df = DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})
df.assign(c=col('a') ** 2)
a b c
0 1 4 1
1 2 5 4
2 3 6 9

Generalizing Operations

Implementing __pow__ to raise values to arbitrary powers is a good start. But what about all the other methods, such as .mean, .sum, or .div?

Thankfully, most of those can be dynamically delegated to the calling object, which is a fancy way of saying "remember what method we attempted to call on the Expr, then when we want to evaluate against data, we should invoke the method that has the same name on the passed data."

The idea goes like this:

from pandas import Series

s = Series([1, 2, 3])

method_name, *args = ('add', 10) # this is our "delayed" method call
getattr(s, method_name)(*args)   # get the Series method `add` and call it with input: `10`
0    11
1    12
2    13
dtype: int64

This pattern can be shortened thanks to the built-in operator.methodcaller.

from operator import methodcaller

func = methodcaller('add', 10)
func(s)
0    11
1    12
2    13
dtype: int64

With the mechanics worked out, we can build this capacity into our Expr object above!

from dataclasses import dataclass, field
from operator import methodcaller

@dataclass
class Expr:
    _stack: list['Expr'] = field(default_factory=list)

    def __call__(self, df):
        for func in self._stack:
            df = func(df)
        return df

    def __pow__(self, other):
        self._stack.append(methodcaller('__pow__', other=other))
        return self

    def __getattr__(self, attr):
        """Return a function that allows args/kwargs to be passed in and
        stores the complete procedure to get the correct method and invoke it with passed args/kwargs
        """
        def store_method(*args, **kwargs):
            self._stack.append(methodcaller(attr, *args, **kwargs))
            return self
        return store_method


df = DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})
print(
    df.assign(
        c=col('a') ** 2,
        d=col('b').add(10),
        f=col('b').add(10).mul(3),
        g=col('b').add(10).mul(3).mean(),
    )
)
   a  b  c   d   f     g
0  1  4  1  14  42  45.0
1  2  5  4  15  45  45.0
2  3  6  9  16  48  45.0

Wrap-Up

And there we have it, a working prototype of the col function that allows expressive column references and chained operations. While this is nowhere near as fully featured as the Expression system staged for pandas 3.0 (our prototype does not support Expr.__repr__ nicely nor does it accommodate namespace accessors [e.g. .str.upper()]), we at least have created a conceptual understanding for how the system works and the brevity it will bring to our code. Can't wait for this feature to be released, and a huge thank you to Marco Gorelli for providing pandas one of its newest user-facing API changes!

What are your thoughts? Let me know on the DUTC Discord server!

Table of Contents
Table of Contents