Polars Expressions on Nested Data#
Welcome back to Cameron’s Corner! This week, I wanted to share another interesting question I came across on Stack Overflow: “How to add numeric value from one column to other List colum elements in Polars?”.
Speaking of Polars, make sure you sign up for our upcoming µTraining, “Blazing Fast Analyses with Polars.” This µtraining is comprised of live discussion, hands-on problem solving, and code review with our instructors. You won’t want to miss it!
Back to the topic at hand.
This question centers around the current lack of broadcasted operations for Polars List
and Array
datatypes (which is a feature request as of 2024-02-07).
To best explore this topic, we’ll need our trusty exception shortener:
from traceback import print_exc
from contextlib import contextmanager
# from polars.exceptions import PolarsPanicError
@contextmanager
def short_traceback(limit=1):
try:
yield
except Exception:
print_exc(limit=limit)
What are Nested Types in Polars?#
One of my favorite features in Polars is the first-class support for nested data structures.
from polars import DataFrame, col, element
df = DataFrame({
'lst': [[0, 1], [9, 8]],
'val': [ 100, 200 ],
})
display(
df.schema,
df,
)
Schema([('lst', List(Int64)), ('val', Int64)])
lst | val |
---|---|
list[i64] | i64 |
[0, 1] | 100 |
[9, 8] | 200 |
Note the datatype of our column 'lst'
is a List(Int64). Polars’ List datatype must have homogeneous values (Int64) in this case.
We cannot mix value-types within the polars.List
or polars.Array
types.
DataFrame({
'lst': [[0, 1 ], [9, 8]],
'obj': [[0, 'hello'], ['world', 8]], # note this is casted to an Object dtype
}, strict=False)
lst | obj |
---|---|
list[i64] | list[str] |
[0, 1] | ["0", "hello"] |
[9, 8] | ["world", "8"] |
For comparison, pandas does not handle any type of nested data. All pandas.Series
that
contain lists will be backed by an object
datatype (arbitrary Python object).
Operations on object
datatypes in pandas will not perform much differently than performing the computation at the Python level via a for-loop.
df.to_pandas().dtypes # `lst` changes from `polars.List` → `pandas.object`
lst object
val int64
dtype: object
Polars List Operations#
Moving back to Polars, the List
datatype gives us some unique expressions that we can
use, accessed via the Expr.list
namespace.
For example, we can use col(…).list.get
to get the nth item from each sublist.
df.with_columns(
first=col('lst').list.get(0),
second=col('lst').list.get(1),
)
lst | val | first | second |
---|---|---|---|
list[i64] | i64 | i64 | i64 |
[0, 1] | 100 | 0 | 1 |
[9, 8] | 200 | 9 | 8 |
We can also use .list.eval
to perform operations within each list
. When using
.list.eval
, we need to pass an expression that contains polars.element
to represent
each individual element within each list
.
from polars import element
df.with_columns(col('lst').list.eval(element() + 9000))
lst | val |
---|---|
list[i64] | i64 |
[9000, 9001] | 100 |
[9009, 9008] | 200 |
from itertools import islice
list_accessor = col('').list
list_accessor_attributes = (
attr for attr in dir(list_accessor)
if callable(getattr(list_accessor, attr)) and not attr.startswith('__')
)
print(
*islice(list_accessor_attributes, 0, 5),
'…',
)
all any arg_max arg_min concat …
Of course, you can also visit the Python Polars List API Reference for an exhaustive list.
The Problem at Hand#
Using the logic above, one may expect to be able to pass a column into an expression
that uses polars.element
for elementwise computation within a list.
The user in this problem wanted to add value 'val'
to each of the nested values
in 'lst'
. Here would be the expected input and output:
display(
'Input',
df,
'Output',
DataFrame({
'lst': [[100, 101], [209, 208]],
'val': [ 100, 200 ],
})
)
'Input'
lst | val |
---|---|
list[i64] | i64 |
[0, 1] | 100 |
[9, 8] | 200 |
'Output'
lst | val |
---|---|
list[i64] | i64 |
[100, 101] | 100 |
[209, 208] | 200 |
So far, the intuitive approach in Polars has solved many of my problems; however, the intuitive approach seems to fall off here:
with short_traceback():
df.with_columns(
col('lst').list.eval(element() + col('val'))
)
Traceback (most recent call last):
File "/tmp/ipykernel_909870/832244687.py", line 8, in short_traceback
yield
polars.exceptions.ComputeError: named columns are not allowed in `list.eval`; consider using `element` or `col("")`
Seems like we’ll need to search for another solution here. Thankfully, we have many to choose from!
Solution(s)#
Here, I have taken the suggested answers from the Stack Overflow question and provided them as possible solutions.
Aligment via explode#
From a mechanical perspective, and the approach we would likely take in pandas,
we can .explode
our DataFrame. This allows us to easily align each value in 'lst'
against the corresponding values from 'val'
to perform our operations.
In this approach, we assign a row index to track the original groupings of values.
Then, we explode our data, perform our computation, and use a .group_by
to recreate
our original List
data:
(
df.with_row_index()
.explode('lst')
.group_by('index').agg(
lst=col('lst') + col('val'),
val=col('val').first(),
)
.drop('index')
)
lst | val |
---|---|
list[i64] | i64 |
[100, 101] | 100 |
[209, 208] | 200 |
In a similar vein, we may also perform the .explode
directly on the 'lst'
column
and use a .over
to perform our alignment. .implode
is used to transform our results
back into lists to align with our original DataFrame.
(
df.with_row_index()
.with_columns(
(col('lst').explode() + col('val')).implode().over('index')
)
.drop('index')
)
lst | val |
---|---|
list[i64] | i64 |
[100, 101] | 100 |
[209, 208] | 200 |
Alignment via programmatic List operations#
Instead of exploding our data to align our nested column against our non-nested
column, we can use the previously mentioned .list.get(…)
method to pull out each
of our values. However, for this approach to work, we NEED to know what the maximum
number of values in a given List
would be. If you can make this assumption, then
the solution arrives quite naturally where we evaluate the two expressions:
col('lst').list.get(0) + col('val') # first element in each list + val
col('lst').list.get(1) + col('val') # second element in each list + val
The above computations will be carried out in parallel. However, the small downside
is that we then need to concat_list
these results back into their original List
data structures, which can be an expensive operation.
from polars import concat_list
(
df.with_columns(
concat_list( # iterate over `range(2)` pre-determined list lengths
col('lst').list.get(i) + col('val') for i in range(2)
)
)
)
lst | val |
---|---|
list[i64] | i64 |
[100, 101] | 100 |
[209, 208] | 200 |
Align dtypes (list)#
With the above reshaping/alignment approaches, I wanted to explore whether or not Polars supports alignment of columns with the exact same data-type.
with short_traceback():
(
df.with_columns(col('lst') + col('val'))
)
Traceback (most recent call last):
File "/tmp/ipykernel_909870/832244687.py", line 8, in short_traceback
yield
polars.exceptions.SchemaError: failed to determine supertype of list[i64] and i64
What if I coerce 'val'
to have the same datatype has 'lst'
?
with short_traceback():
(
df.cast({'val': df.schema['lst']})
.with_columns(col('lst') + col('val'))
)
Traceback (most recent call last):
File "/tmp/ipykernel_909870/832244687.py", line 8, in short_traceback
yield
polars.exceptions.InvalidOperationError: `add` operation not supported for dtype `list[i64]`
It seems like neither of the above approaches worked. To make sure, I also tested these two approaches using the polars.Array
datatype and achieved the same results. Broadcasting over the Array
datatype does seem like an important feature, especially for all of the Python-Polars users who are migrating from tools like NumPy.
Align dtypes (struct)#
Interestingly enough, there IS a nested datatype that broadcasts columnar operations:
polars.Struct
! All we need to do is convert both our 'lst'
and 'val'
columns to struct
s,
and we can carry out the operation as we expected to.
from polars import struct
(
df.with_columns(
(col('lst').list.to_struct() + struct(col('val')))
)
)
lst | val |
---|---|
struct[2] | i64 |
{100,101} | 100 |
{209,208} | 200 |
The only caveat is that the the returned datatype is also a struct
, and, if we
want our List
datatype back, we’ll need to apply another conversion.
DuckDB (struct broadcasting)#
The final approach relies on performing computation in DuckDB instead of Polars and
leveraging the shared memory interface to perform a zero copy conversion.
We can use our Polars.DataFrame
to execute a query in DuckDB
and finally
return a Polars.DataFrame
without a TON of overhead (but, yes, there will be overhead to pay).
import duckdb
from polars import DataFrame, col, struct, concat_list
df = DataFrame({
'lst': [[0, 1], [9, 8]],
'val': [ 100, 200 ],
})
duckdb.sql("""
select
(select list(l.e + a.val) from unnest(lst) as l(e)) as lst,
a.val,
from df as a
""").pl()
lst | val |
---|---|
list[i64] | i64 |
[100, 101] | 100 |
[209, 208] | 200 |
Wrap-Up#
And there you have it: five different ways to solve a seemingly innocuous problem. I am very excited for the continued development of Polars to round out the API. I feel nested elementwise operations should probably be allowed to access other columns and have hopes that this feature will land at some point. Other than that, coupling Polars and DuckDB is incredibly powerful, and I will continue to explore this pairing in the future.
What do you think about my approach? Let us know on the DUTC Discord server.
Talk to you all next time!