Polars Expressions on Nested Data#

Welcome back to Cameron’s Corner! This week, I wanted to share another interesting question I came across on Stack Overflow: “How to add numeric value from one column to other List colum elements in Polars?”.

Speaking of Polars, make sure you sign up for our upcoming µTraining, “Blazing Fast Analyses with Polars.” This µtraining is comprised of live discussion, hands-on problem solving, and code review with our instructors. You won’t want to miss it!

Back to the topic at hand.

This question centers around the current lack of broadcasted operations for Polars List and Array datatypes (which is a feature request as of 2024-02-07).

To best explore this topic, we’ll need our trusty exception shortener:

from traceback import print_exc
from contextlib import contextmanager
# from polars.exceptions import PolarsPanicError

@contextmanager
def short_traceback(limit=1):
    try:
        yield
    except Exception:
        print_exc(limit=limit)

What are Nested Types in Polars?#

One of my favorite features in Polars is the first-class support for nested data structures.

from polars import DataFrame, col, element

df = DataFrame({
    'lst': [[0, 1], [9, 8]],
    'val': [ 100,    200  ],
})

display(
    df.schema,
    df,
)
Schema([('lst', List(Int64)), ('val', Int64)])
shape: (2, 2)
lstval
list[i64]i64
[0, 1]100
[9, 8]200

Note the datatype of our column 'lst' is a List(Int64). Polars’ List datatype must have homogeneous values (Int64) in this case.

We cannot mix value-types within the polars.List or polars.Array types.

DataFrame({
    'lst': [[0, 1      ], [9,       8]],
    'obj': [[0, 'hello'], ['world', 8]], # note this is casted to an Object dtype
}, strict=False)
shape: (2, 2)
lstobj
list[i64]list[str]
[0, 1]["0", "hello"]
[9, 8]["world", "8"]

For comparison, pandas does not handle any type of nested data. All pandas.Series that contain lists will be backed by an object datatype (arbitrary Python object). Operations on object datatypes in pandas will not perform much differently than performing the computation at the Python level via a for-loop.

df.to_pandas().dtypes # `lst` changes from `polars.List` → `pandas.object`
lst    object
val     int64
dtype: object

Polars List Operations#

Moving back to Polars, the List datatype gives us some unique expressions that we can use, accessed via the Expr.list namespace.

For example, we can use col(…).list.get to get the nth item from each sublist.

df.with_columns(
    first=col('lst').list.get(0),
    second=col('lst').list.get(1),
)
shape: (2, 4)
lstvalfirstsecond
list[i64]i64i64i64
[0, 1]10001
[9, 8]20098

We can also use .list.eval to perform operations within each list. When using .list.eval, we need to pass an expression that contains polars.element to represent each individual element within each list.

from polars import element

df.with_columns(col('lst').list.eval(element() + 9000))
shape: (2, 2)
lstval
list[i64]i64
[9000, 9001]100
[9009, 9008]200
from itertools import islice
list_accessor = col('').list
list_accessor_attributes = (
    attr for attr in dir(list_accessor)
    if callable(getattr(list_accessor, attr)) and not attr.startswith('__')
)

print(
    *islice(list_accessor_attributes, 0, 5),
    '…',
)
all any arg_max arg_min concat …

Of course, you can also visit the Python Polars List API Reference for an exhaustive list.

The Problem at Hand#

Using the logic above, one may expect to be able to pass a column into an expression that uses polars.element for elementwise computation within a list.

The user in this problem wanted to add value 'val' to each of the nested values in 'lst'. Here would be the expected input and output:

display(
    'Input',
    df,
    'Output',
    DataFrame({
        'lst': [[100, 101], [209, 208]],
        'val': [ 100,        200      ],
    })
)
'Input'
shape: (2, 2)
lstval
list[i64]i64
[0, 1]100
[9, 8]200
'Output'
shape: (2, 2)
lstval
list[i64]i64
[100, 101]100
[209, 208]200

So far, the intuitive approach in Polars has solved many of my problems; however, the intuitive approach seems to fall off here:

with short_traceback():
    df.with_columns(
        col('lst').list.eval(element() + col('val'))
    )
Traceback (most recent call last):
  File "/tmp/ipykernel_909870/832244687.py", line 8, in short_traceback
    yield
polars.exceptions.ComputeError: named columns are not allowed in `list.eval`; consider using `element` or `col("")`

Seems like we’ll need to search for another solution here. Thankfully, we have many to choose from!

Solution(s)#

Here, I have taken the suggested answers from the Stack Overflow question and provided them as possible solutions.

Aligment via explode#

From a mechanical perspective, and the approach we would likely take in pandas, we can .explode our DataFrame. This allows us to easily align each value in 'lst' against the corresponding values from 'val' to perform our operations.

In this approach, we assign a row index to track the original groupings of values. Then, we explode our data, perform our computation, and use a .group_by to recreate our original List data:

(
    df.with_row_index()
    .explode('lst')
    .group_by('index').agg(
        lst=col('lst') + col('val'),
        val=col('val').first(),
    )
    .drop('index')
)
shape: (2, 2)
lstval
list[i64]i64
[100, 101]100
[209, 208]200

In a similar vein, we may also perform the .explode directly on the 'lst' column and use a .over to perform our alignment. .implode is used to transform our results back into lists to align with our original DataFrame.

(
    df.with_row_index()
    .with_columns(
        (col('lst').explode() + col('val')).implode().over('index')
    )
    .drop('index')
)
shape: (2, 2)
lstval
list[i64]i64
[100, 101]100
[209, 208]200

Alignment via programmatic List operations#

Instead of exploding our data to align our nested column against our non-nested column, we can use the previously mentioned .list.get(…) method to pull out each of our values. However, for this approach to work, we NEED to know what the maximum number of values in a given List would be. If you can make this assumption, then the solution arrives quite naturally where we evaluate the two expressions:

col('lst').list.get(0) + col('val') # first element in each list + val
col('lst').list.get(1) + col('val') # second element in each list + val

The above computations will be carried out in parallel. However, the small downside is that we then need to concat_list these results back into their original List data structures, which can be an expensive operation.

from polars import concat_list

(
    df.with_columns(
        concat_list( # iterate over `range(2)` pre-determined list lengths
            col('lst').list.get(i) + col('val') for i in range(2)
        )
    )
)
shape: (2, 2)
lstval
list[i64]i64
[100, 101]100
[209, 208]200

Align dtypes (list)#

With the above reshaping/alignment approaches, I wanted to explore whether or not Polars supports alignment of columns with the exact same data-type.

with short_traceback():
    (
        df.with_columns(col('lst') + col('val'))
    )
Traceback (most recent call last):
  File "/tmp/ipykernel_909870/832244687.py", line 8, in short_traceback
    yield
polars.exceptions.SchemaError: failed to determine supertype of list[i64] and i64

What if I coerce 'val' to have the same datatype has 'lst'?

with short_traceback():
    (
        df.cast({'val': df.schema['lst']})
        .with_columns(col('lst') + col('val'))
    )
Traceback (most recent call last):
  File "/tmp/ipykernel_909870/832244687.py", line 8, in short_traceback
    yield
polars.exceptions.InvalidOperationError: `add` operation not supported for dtype `list[i64]`

It seems like neither of the above approaches worked. To make sure, I also tested these two approaches using the polars.Array datatype and achieved the same results. Broadcasting over the Array datatype does seem like an important feature, especially for all of the Python-Polars users who are migrating from tools like NumPy.

Align dtypes (struct)#

Interestingly enough, there IS a nested datatype that broadcasts columnar operations: polars.Struct! All we need to do is convert both our 'lst' and 'val' columns to structs, and we can carry out the operation as we expected to.

from polars import struct

(
    df.with_columns(
        (col('lst').list.to_struct() + struct(col('val')))
    )
)
shape: (2, 2)
lstval
struct[2]i64
{100,101}100
{209,208}200

The only caveat is that the the returned datatype is also a struct, and, if we want our List datatype back, we’ll need to apply another conversion.

DuckDB (struct broadcasting)#

The final approach relies on performing computation in DuckDB instead of Polars and leveraging the shared memory interface to perform a zero copy conversion. We can use our Polars.DataFrame to execute a query in DuckDB and finally return a Polars.DataFrame without a TON of overhead (but, yes, there will be overhead to pay).

import duckdb
from polars import DataFrame, col, struct, concat_list

df = DataFrame({
    'lst': [[0, 1], [9, 8]],
    'val': [ 100,    200  ],
})


duckdb.sql("""
    select
        (select list(l.e + a.val) from unnest(lst) as l(e)) as lst,
        a.val,
    from df as a
""").pl()
shape: (2, 2)
lstval
list[i64]i64
[100, 101]100
[209, 208]200

Wrap-Up#

And there you have it: five different ways to solve a seemingly innocuous problem. I am very excited for the continued development of Polars to round out the API. I feel nested elementwise operations should probably be allowed to access other columns and have hopes that this feature will land at some point. Other than that, coupling Polars and DuckDB is incredibly powerful, and I will continue to explore this pairing in the future.

What do you think about my approach? Let us know on the DUTC Discord server.

Talk to you all next time!