Structuring Narwhals Expressions

Last week, I discussed a possible approach to unwrapping Narwhals Expressions, focusing on the core programming concepts behind them.

To finish our discussion this week, I wanted to share some of the additional features we'll need to add to make convenient use of this parsing mechanism and share one of the applications of this work.

Parsing Expressions

Let's pick up where we left off: last week, we came up with a working method to parse a simple Narwhals Expression. Given the expression below, let's unwrap the underlying functions that will be called at each step of the computation.

import narwhals as nw # 1.41.0

expr = nw.col('a').fill_null(0) + 1

Given this expression, we can visually tell that our code should first select a given column 'a', then fill any null values in that column with 0, and finally add 1 to the resultant values.

While it is nice that we can expect these computations to be run by reading through the code, we are lacking a simple reflection of this structure so that we can programmatically inspect it. To parse it, we'll need to reach for the code that we wrote last week.

from inspect import getclosurevars

expr = nw.col('a').fill_null(0) + 1

leaf_func = getclosurevars(expr._to_compliant_expr).nonlocals['to_compliant_expr']

# extract the parent narwhals.Expr that was closed over in the `.__add__` call
parent_expr1 = getclosurevars(leaf_func).nonlocals['self']
parent_func1 = getclosurevars(parent_expr1._to_compliant_expr).nonlocals['to_compliant_expr']

# extract the parent narwhals.Expr that was closed over in the `fill_null` call
parent_expr2 = getclosurevars(parent_func1).nonlocals['self']
parent_func2 = getclosurevars(parent_expr2._to_compliant_expr).nonlocals['to_compliant_expr']

print(
    f'{parent_func2 = }',
    f'{parent_func1 = }',
    f'{leaf_func    = }',
    sep='\n'
)

parent_func2 = <function col.<locals>.func at 0x781d4c5d99e0>
parent_func1 = <function Expr.fill_null.<locals>.<lambda> at 0x781d4c5d9a80>
leaf_func    = <function Expr.__add__.<locals>.<lambda> at 0x781d4c5d9bc0>

The above code serves as a minimal prototype to help us assert that this is possible to do. However, we are parsing an expression here of a known depth. This means that I can use my knowledge of what the expression is to define how many parent_expr* and parent_func* variables to create to parse these entities out into each step.

To make this code more reusable, we'll need to write some parsing code that alleviates this constraint. In fact, we have two very common approaches for traversing a structure like this: recursion or iteration. Given that Python has a recursion limit and no optimizations built around it, I opted for an approach that uses the latter.

Generalized Expression Parsing

With this approach, I set up a stack (modeled via a Python list) of expressions to process them one by one storing the inspected Expression object and its enclosed function into a list of tuples.

from collections import deque
from inspect import getclosurevars
from typing import Callable
import narwhals as nw

# Expression reprs are very verbose and display metadata that is unimportant for our exploration
#  reverting the __repr__ back to the class name and memory address aids debugability
nw.Expr.__repr__ = lambda self: f'<{type(self).__qualname__} at {hex(id(self))}>'


def parse(expr: nw.Expr) -> list[tuple[nw.Expr, Callable]]:
    stack, nodes = [expr], []
    while stack:
        expr = stack.pop()
        func = getclosurevars(expr._to_compliant_expr).nonlocals['to_compliant_expr']
        nodes.append((expr, func))

        try:
            parent_expr = getclosurevars(func).nonlocals['self']
        except KeyError: # no parent expression, we've reached the root expression
            # this is the "base case" in the recursive formulation.
            continue
        else:
            # if there is a parent expression, we add it to our list to process
            # this is the inductive step in the recursive formulation.
            stack.append(parent_expr)
            
    return nodes

expr = nw.col('a').fill_null(0) + 1
parse(expr)

[(<Expr at 0x781d4c5bbe90>,
  <function narwhals.expr.Expr.__add__.<locals>.<lambda>(plx)>),
 (<Expr at 0x781d4c5bbec0>,
  <function narwhals.expr.Expr.fill_null.<locals>.<lambda>(plx)>),
 (<Expr at 0x781d4c5bbcb0>,
  <function narwhals.functions.col.<locals>.func(plx: 'Any') -> 'Any'>)]

And just like that, we've processed our expression call chain! The result here is stored in reverse order so you would read this from bottom to top:

select a column via nw.functions.col
fill null values via Expr.fill_null
add a value via Expr.__add__

This matches our visual parsing of the same expression we performed above.

While this more generic parser was fairly straightforward to write, we should acknowledge some limitations of our current representation. Returning a list of tuples is useful for viewing and understanding the sequence of operations, but it’s not ideal for manipulation or analysis. These computations are better modeled as a directed acyclic graph (DAG), where each operation is a node and edges represent data flow between steps. Representing the expression chain as a DAG would allow us to traverse, optimize, or even reconfigure the computation more flexibly.

Modeling Nodes as Data

Before we try to add too many features all at once, let's first swap out the simplistic tuple modeling with something a bit more flexible. In this case, I'll write a basic dataclass with a constructor that handles the function parsing. Each Node will consist of the expression that created it and the function that was enclosed in the expression object.

We'll also need to adjust our parser a bit, but we'll end with some code that will be easier to extend.

from typing import Callable, Any
from dataclasses import dataclass, field

@dataclass(frozen=True)
class Node:
    expr: nw.Expr = field(repr=False)
    func: Callable

    @classmethod
    def from_expr(cls, expr: nw.Expr) -> 'Node':
        func = getclosurevars(expr._to_compliant_expr).nonlocals['to_compliant_expr']
        return cls(expr=expr, func=func)


def parse(expr: nw.Expr) -> list[Node]:
    stack, nodes = [expr], []
    while stack:
        nodes.append(node := Node.from_expr(stack.pop()))

        try:
            parent_expr = getclosurevars(node.func).nonlocals['self']
        except KeyError: # no parent expression, we've reached the root expression
            continue
        else:
            stack.append(parent_expr)
            
    return nodes

expr = nw.col('a').fill_null(0) + 1
parse(expr)

[Node(func=<function Expr.__add__.<locals>.<lambda> at 0x781d4c5da700>),
 Node(func=<function Expr.fill_null.<locals>.<lambda> at 0x781d4c5da5c0>),
 Node(func=<function col.<locals>.func at 0x781d4c5d94e0>)]

Now we have a nice representation of each of these nodes with simple access to its data. This still isn't much different from our tuple modeling, but we can easily extend our Node class by adding more methods/extending existing ones.

With that said, our parser needs a bit more work. If you haven't noticed yet, the expressions that we are currently parsing are very simple. Take a look at this composite expression:

expr = (nw.col('a').fill_null(0)) + (nw.col('b') * 2)
parse(expr)

[Node(func=<function Expr.__add__.<locals>.<lambda> at 0x781d4c5daf20>),
 Node(func=<function Expr.fill_null.<locals>.<lambda> at 0x781d4c5daac0>),
 Node(func=<function col.<locals>.func at 0x781d4c5da8e0>)]

If we ignore the output and inspect the expression visually, we'll see that there are multiple roots here: we independently select column 'a' and fill it, AND we select column 'b' and multiply it by 2. Then we take the result of those operations and add them together. Yet our parser returns only has a single column selection for column 'a' (which I reasoned about deductively since we see the Expr.fill_null operation and not the Expr.__mul__ operation).

So far, our Expressions have always contained a single root and a single leaf meaning that there is a direct linear flow of data through the graph. But DAGs get much more complex than that, we can have multiple inputs and multiple outputs at each step. While we won't model all of this complexity in a blog post, let's see if we can handle the case of parsing an expression with multiple roots.

Composite Expressions

To satisfy the issues we identified above we should also parse out the arguments from each Expression. This will help us disambiguate what Node we're actually investigating. Additionally, since our graph is a composition of subgraphs (here we saw 2 subgraphs that were independent of one another until) we should be able to preserve the independence of these subraphs in our resultant data structure.

To do this, I decided to make the Node instance traversable by climbing up its own parents (meaning each Node will be aware of the other expressions that occur before it). Additionally, I had to update the parser to return a dictionary of Expressions mapping to lists of nodes. Where each entry in this dictionary represents its own subgraph.

from typing import Callable, Any, Union, Deque
from dataclasses import dataclass, field
from collections import deque

# we’re going to use Expressions as keys in a dict
nw.Expr.__hash__ = lambda self: hash(id(self)) 

@dataclass(frozen=True)
class Node:
    expr: nw.Expr = field(repr=False)
    func: Callable
    args: list[Any]
    parent: Union["Node", None] = field(default=None, repr=False)

    @classmethod
    def from_expr(cls, expr: nw.Expr) -> 'Node':
        func = getclosurevars(expr._to_compliant_expr).nonlocals['to_compliant_expr']
        args = getclosurevars(func).nonlocals        
        try:
            parent_expr = args.pop('self')
        except KeyError:
            parent = None
        else:
            # recursively create all parent Nodes from enclosed expressions
            parent = cls.from_expr(parent_expr)
        return cls(func=func, args=args, expr=expr, parent=parent)

    def parents(self):
        """Traverses the nodes, yields the current Node first.
        """
        while self is not None:
            yield self
            self = self.parent

def parse(expr: nw.Expr) -> dict[nw.Expr, Deque[Node]]:
    graphs: dict[nw.Expr, Deque[Node]] = {}
    stack = [expr]
    while stack:
        leaf = Node.from_expr(expr := stack.pop())
        g = deque([])
        for node in leaf.parents():
            g.appendleft(node)  # store traversals as root → leaf
            for arg in node.args.values():
                if isinstance(arg, nw.Expr):
                    stack.append(arg)
        graphs[expr] = g
    return graphs


expr = (nw.col('a').fill_null(0)) + (nw.col('b') * 2)
parse(expr)

{<Expr at 0x781d4c5e1370>: deque([Node(func=<function col.<locals>.func at 0x781d4c5d93a0>, args={'flat_names': ['a']}),
        Node(func=<function Expr.fill_null.<locals>.<lambda> at 0x781d4c5db7e0>, args={'limit': None, 'strategy': None, 'value': 0}),
        Node(func=<function Expr.__add__.<locals>.<lambda> at 0x781d4c5dbba0>, args={'other': <Expr at 0x781d4c5e1280>})]),
 <Expr at 0x781d4c5e1280>: deque([Node(func=<function col.<locals>.func at 0x781d4c5db920>, args={'flat_names': ['b']}),
        Node(func=<function Expr.__mul__.<locals>.<lambda> at 0x781d4c5dba60>, args={'other': 2})])}

Now we have much more visibility into the computation than we did before! Not only do we know how many subgraphs compose this graph, but we can also view how many function calls compose each subgraph and the arguments that they were constructed with. If any Expression depends on another, this should appear as a new subgraph entry in the dictionary (take a close look at the 'other' argument to the Expr.__add__ node. You'll see that it depends on the same expression object that is found as the second key in this dictionary.

So what can we do with this amount of reflection in our computation? Well, something straightforward to do will be to give this computation a nice __repr__. In the most common form, the __repr__ of a Python object should return a string that can be evaluated to recreate the original object. Many 3rd party packages break this convention (looking at you NumPy, pandas, Polars… these are outliers because the state of the object is dependent on large volumes of data).

That said, we're just representing the steps of a computation. We should be able to recreate a string that can be evaluated back into those individual steps.

Expression Rendering

For the final step of our journey, we can reuse all of the code we've already written. We're going to simply extend our Node class with a couple of new methods that...

Extract the name of the enclosed function
Create a string that strings together the expression calls.
- for this, we want the result to be very close to what the input looked like.

As a brief example of what we want our result to look like, take a look at this Polars expression

import polars as pl

(pl.col('a').fill_null(0)) + (pl.col('b') * 2)

[(col("a").fill_null([dyn int: 0])) + ([(col("b")) * (dyn int: 2)])]

You'll see that there is a direct relationship between the code I wrote and the representation of that expression object. This is something we are hoping to achieve with Narwhals Expressions as well.

While our modeling is a bit more crude (this is a blog post, not a Pull Request after all) we can actually get quite close:

import re

@dataclass(frozen=True)
class Node(Node): # same from_expr & parents methods; 
                  # I reused the same class name to show that the parse does not need to change.
    expr: nw.Expr = field(repr=False)
    func: Callable
    args: list[Any]
    parent: Union["Node", None] = field(default=None, repr=False)

    def __post_init__(self):
        if self.func.__qualname__.startswith('col'):
            func_name = 'col'
        else:
            # parse the function name out of its qualified name
            func_name = (
                re.search(r'Expr\.(\w+)\..+$', self.func.__qualname__)
                .group(1).strip('_') # removes underscores from data model methods
            )                        #   e.g. __mul__ → mul 
        object.__setattr__(self, 'func_name', func_name)

    def render_expr(self, include_parents=False):
        """Represent the Node as a string. Akin to a __repr__ of the Expression

        include_parents: should the parent nodes be rendered as well? Or just the current Node?
        """
        args = []
        
        # Traverse down arguments that are Expressions and render them
        for arg in self.args.values():
            if isinstance(arg, nw.Expr):
                root = type(self).from_expr(arg)
                node_reprs = deque([])
                for node in root.parents():
                    node_reprs.appendleft(node.render_expr())
                arg = ".".join(node_reprs)
            args.append(arg)

        if include_parents is False or self.parent is None:
            return f'{self.func_name}({", ".join(map(str, args))})'
        return (
            f'{self.parent.render_expr(include_parents=include_parents)}'
            f'.{self.func_name}({", ".join(map(str, args))})'
        )


expr = (nw.col('a').fill_null(0)) + (nw.col('b') * 2)
for root, nodelist in parse(expr).items(): # render each subgraph separately
    print(root, '→', [node.render_expr() for node in nodelist])

<Expr at 0x781d4c1bcc80> → ["col(['a'])", 'fill_null(None, None, 0)', "add(col(['b']).mul(2))"]
<Expr at 0x781d4c1bcb60> → ["col(['b'])", 'mul(2)']

Since we move most of the actual parsing logic into the dataclass, and out of the parse function (to be honest, we should probably rename the function at this point because its main purpose is to organize subgraphs). We can simply convert our expression to its Node counterpart and render it all in a single go!

expr = (nw.col('a').fill_null(0)) + (nw.col('b') * 2)

print( # render the Node in its entirety
    Node.from_expr(expr).render_expr(include_parents=True)
)

col(['a']).fill_null(None, None, 0).add(col(['b']).mul(2))

We can even handle more complex expressions like the one below! Accurately identifying each subgraph and representing its flow as a string.

expr = (
   ((~nw.col('a')) ** 2 * (nw.col('b') - 3) + 10).mean() // nw.col('c')
)
for root, nodelist in parse(expr).items():
    print(root, '→', [node.render_expr() for node in nodelist])

<Expr at 0x781d4c1be0c0> → ["col(['a'])", 'invert()', 'pow(2)', "mul(col(['b']).sub(3))", 'add(10)', 'mean()', "floordiv(col(['c']))"]
<Expr at 0x781d4c1bdc40> → ["col(['b'])", 'sub(3)']
<Expr at 0x781d4c1bdf70> → ["col(['c'])"]

Wrap-Up

With just a bit of reflection and some light introspection, we’ve managed to unpack quite a bit about how Narwhals Expressions are structured. Starting from a hardcoded chain of getclosurevars calls, we built up a generalized parser, refactored our representation into a more flexible data structure, and even modeled the expression graph as a set of independent subgraphs. From there, we used that structure to recreate a string representation of each computation, getting surprisingly close to the original code.

There’s still plenty more we could do with this setup. A DAG model of expressions can be a powerful tool for things like optimization, analysis, or even serialization. But for now, it’s just nice to see that a little structure and a little effort go a long way toward making these kinds of APIs more transparent and easier to work with.

What do you think about our discussion so far? Let me know on the DUTC Discord server!

That's all for this week! Until next time.