Table of Contents

The Polars Schema Inference Trap

Oh no! There's bug in Polars!

Before anyone grabs their pitchfork, let me set the scene:

While working with Polars (v1.31.0), I stumbled onto a schema inference surprise that I wasn’t expecting. If you're using a generator to populate a DataFrame and relying on Polars to figure out the schema for you, there's a slight catch. The behavior is subtle and a little confusing, especially if you believe the documentation at face value.

Some things to note:

  1. This is not some massive Polas bug—I'm not sure how many users are creating DataFrames from an iterable

  2. Polars is a great project, and every project will have its share of bugs.

  3. I've already reported this and opened a small PR as an attempt to fix it

According to the Polars documentation, the infer_schema_length parameter controls how many rows Polars inspects to infer the schema when the input is a sequence or generator:

infer_schema_length: int or None
The maximum number of rows to scan for schema inference. If set to None, the full data may be scanned (this can be slow).

But that word “may” is doing a lot of heavy lifting.

Let’s test it.

Two Cases, One Subtle Difference

# just some helper for later
from traceback import print_exc
from contextlib import contextmanager
from sys import stdout

@contextmanager
def short_traceback(limit=1):
    try:
        yield
    except Exception:
        print_exc(limit=limit, file=stdout)
import polars as pl

def generate_data(change_schema_at, nrows=2000):
    assert nrows >= change_schema_at
    for i in range(nrows):
        if i < change_schema_at:
            yield {'a': 1, 'b': 1.0, 'c': 'xyz'}
        else:
            yield {'a': 1.1, 'b': 2.0, 'c': 'jkl'}

Now let's try building a DataFrame, telling Polars to infer the schema from the full generator:

pl.from_dicts(
    generate_data(change_schema_at=999), infer_schema_length=None
).tail(3)
shape: (3, 3)
abc
f64f64str
1.12.0"jkl"
1.12.0"jkl"
1.12.0"jkl"

This looks good. Column a correctly promotes to polars.Float64. ` But now change the cutoff:

pl.from_dicts(
    generate_data(change_schema_at=1_000), infer_schema_length=None
).tail(3)  # floats were silently cast to integers!
shape: (3, 3)
abc
i64f64str
12.0"jkl"
12.0"jkl"
12.0"jkl"

Even though rows after row 1000 include floats in column a, Polars infers the type as i64. That means it’s not scanning the full dataset, even though we explicitly passed infer_schema_length=None.

If you're following along, you may wonder what is so special about row 1,000? Well, it's a bit of an undocumented parameter that we end-users can't really access from either polars.DataFrame or polas.from_dicts. So let's talk about why this happens in the first place.

Why This Happens

The problem lies in how Polars handles chunking under the hood.

Looking at the source for the current state of the Polars source code at the time of writing. We find that when infer_schema_length=None and adaptive_chunk_size=None (a parameter not directly exposed to the end user via polars.DataFrame or polars.from_dicts), Polars defaults to a chunk size of 1,000. So it takes the first 1000 rows, uses that chunk to infer the schema, and assumes that type holds for the rest. If any values after that chunk would trigger a type promotion, they’re silently coerced.

In this case, since the first 1000 values for a are all integers, the column becomes i64, even though floats show up later.

Looking at the specific constructor function linked above, we can see that we’re able to replicate this result with a property.

from polars._utils.construction.dataframe import iterable_to_pydf

d = iterable_to_pydf(
    generate_data(n_minus_1 := 100),
    infer_schema_length=None,
    chunk_size=n_minus_1,
)

d.dtypes()
[Int64, Float64, String]
from polars._utils.construction.dataframe import iterable_to_pydf

d = iterable_to_pydf(
    generate_data(n_minus_1 := 100, nrows=2000),
    infer_schema_length=None,
    chunk_size=n_minus_1 + 1, # if we set the chunk size to expand into the Float64 segment
)

# then we appropriately detect the dtypes
d.dtypes()
[Float64, Float64, String]

However, this behavior is only observed for "promotable" types (e.g. int → float) if there is a hard type mismatch, then this will raise when Polars attempts to DataFrame.vstack the chunks:

def generate_data(change_schema_at, nrows=2000):
    assert nrows >= change_schema_at
    for i in range(nrows):
        if i < change_schema_at:
            yield {'a': 1, 'b': 1.0, 'c': 'xyz'}
        else:
            # note that {'a': 1} and {'a': 'hello world'} are non-promotable types
            #   instead, the integer 1 has to be interpreted as a string which 
            #   Polars doesn’t like.
            yield {'a': 'hello world', 'b': 2.0, 'c': 'jkl'}

with short_traceback():
    pl.from_dicts(generate_data(change_schema_at=1000), infer_schema_length=None)
Traceback (most recent call last):
  File "/tmp/ipykernel_2168379/2847243529.py", line 9, in short_traceback
    yield
polars.exceptions.ComputeError: could not append value: "hello world" of type: str to the builder; make sure that all rows have the same schema or consider increasing `infer_schema_length`

it might also be that a value overflows the data-type's capacity

Workarounds

If you're using pl.from_dicts(...) directly on a generator and expect to infer the schema across the full dataset, a safe option is to materialize the data first:

import polars as pl

def generate_data(change_schema_at, nrows=2000):
    assert nrows >= change_schema_at
    for i in range(nrows):
        if i < change_schema_at:
            yield {'a': 1, 'b': 1.0, 'c': 'xyz'}
        else:
            yield {'a': 1.1, 'b': 2.0, 'c': 'jkl'}

# passing a list triggers a different set of function calls that handle `infer_schema_length` 
#    differently
pl.from_dicts(
    [*generate_data(change_schema_at=1999, nrows=2000)],
    infer_schema_length=None
).tail(3)
shape: (3, 3)
abc
f64f64str
1.01.0"xyz"
1.01.0"xyz"
1.12.0"jkl"

What Should Change?

I would say that either:

  • Polars should scan the full iterator when infer_schema_length=None, even if chunking is in place, or

  • The documentation should call out this edge case clearly: passing None will only scan a single chunk — if your data changes shape after 1000 rows, you’re on your own.

Final Thoughts

This is one of those bugs that quietly corrupts your data: it doesn’t always throw an error, it just gets the schema wrong. And if you’re like me, you might pass infer_schema_length=None, assuming that means “play it safe.”

Perhaps this also highlights that one should always explicitly specify a schema when possible. When the data we ingest changes, we don’t want silent errors, we want loud ones that notify us of the change and can therefore address the problem.

What are your thoughts? Let me know on the DUTC Discord server!

Table of Contents
Table of Contents