Table of Contents

NaN vs Null in pandas & Polars

Hello, everyone! Welcome back to Cameron's Corner! I have some exciting plans coming up this year, and I can't wait to show you everything in store. Before I jump into your regularly scheduled content, I want to let you know about an upcoming event I'll be hosting on YouTube on January 6th: DUTC's NASA Open Science Training - Level Up Your Open Science Skills.

This hands-on session is designed to help you enhance your Open Science skill set, collaborate with fellow researchers, and dive deeper into essential tools that empower open and reproducible science. This seminar will revolve around necessary computational skills for Open Science, with a focus on Python and reproducibility. Learn best practices for managing and collaborating on your computational research while also creating beautiful papers & project websites using MyST.

Now, let's jump in!

I was recently doing some open source work on Narwhals, the “Extremely lightweight and extensible compatibility layer between dataframe libraries!” Narwhals is essentially a library that lets you define expressions that operate on DataFrames that will work if you pass them a DataFrame from pandas, Polars, dask, Modin, and PyArrow (to name a few). Chances are that by the time you are reading this, Narwhals will have expended to support even more compute backends. Since it is a relatively new, DataFrame-oriented project, I decided to put some time into it and contributed a new feature that enables one to detect NaN values in a column behind the aptly named .is_nan() method.

Of course, no feature request is without its rough edges, and I knew that implementing appropriate support for pandas was going to be a little tricky because of its ambiguous treatment of Not-A-Number (NaN) and Null values.

Not-A-Number (NaN)

What is the difference between NaN and Null? Are these not interchangeable? As it turns out, NaN values are a special value defined by the IEEE Standard for Floating-Point Arithmetic and are used to signal the result of an invalid operation such as division of 0 by 0 or taking the square root of a negative number. While I can't say this is an exhaustive approach to naturally encounter NaN values in your data, it is important to note that you will encounter NaN values with fairly typical mathematical operations. Sure, you might not explicitly divide by 0, but you will probably divide one column by another, and who is to say that the value 0 will never appear in both the numerator and denominator?

In Python, we denote NaN with float("nan"), and Polars can take it from there. Let's take a look at some operations that result in NaN values:

import polars as pl
from polars import col

df = (
    pl.DataFrame({'data': [float("nan"), -1, 0, 1]})
    .with_columns(
        div_by_0=col('data') / 0, # 0 / 0    ⇒ NaN
        sqrt=col('data').sqrt(),  # sqrt(-1) ⇒ NaN
    )
)

df
shape: (4, 3)
datadiv_by_0sqrt
f64f64f64
NaNNaNNaN
-1.0-infNaN
0.0NaN0.0
1.0inf1.0

The first thing that you may notice in the above example is not just the NaN values, but also the infinity inf values as well. These infinities are just like NaN in that they are special values reserved by IEEE 754 that help the standard be “...algebraically complete: every floating-point operation produces a well-defined result and will not—by default—throw a machine interrupt or trap." IEEE_754—Design Rationale.

Another thing you may notice is that any operation that involves a NaN results in a NaN value. This causes NaNs to "infect" other values, but considering the meaning that we have established of NaN this is likely a desired behavior.

Now that we appreciate why NaN exists, we should also be wary of an important aspect that you may have noticed in the above example: the datatype! All of the columns have a float64 datatype, and you may remember I have mentioned multiple times that...

... NaN is a special value defined by the IEEE Standard for Floating-Point Arithmetic

This does imply that our other numeric types (namely integers) do not have any notion of NaN (or other special values like infinities).

from polars import Int64
df.cast(Int64)
---------------------------------------------------------------------------
InvalidOperationError                     Traceback (most recent call last)
Cell In[2], line 2
      1 from polars import Int64
----> 2 df.cast(Int64)

File ~/.pyenv/versions/dutc-site-new/lib/python3.12/site-packages/polars/dataframe/frame.py:8003, in DataFrame.cast(self, dtypes, strict)
   7919 def cast(
   7920     self,
   7921     dtypes: (
   (...)
   7928     strict: bool = True,
   7929 ) -> DataFrame:
   7930     """
   7931     Cast DataFrame column(s) to the specified dtype(s).
   7932 
   (...)
   8001      'ham': ['2020-01-02', '2021-03-04', '2022-05-06']}
   8002     """
-> 8003     return self.lazy().cast(dtypes, strict=strict).collect(_eager=True)

File ~/.pyenv/versions/dutc-site-new/lib/python3.12/site-packages/polars/lazyframe/frame.py:2043, in LazyFrame.collect(self, type_coercion, predicate_pushdown, projection_pushdown, simplify_expression, slice_pushdown, comm_subplan_elim, comm_subexpr_elim, cluster_with_columns, collapse_joins, no_optimization, streaming, engine, background, _check_order, _eager, **_kwargs)
   2041 # Only for testing purposes
   2042 callback = _kwargs.get("post_opt_callback", callback)
-> 2043 return wrap_df(ldf.collect(callback))

InvalidOperationError: conversion from `f64` to `i64` failed in column 'data' for 1 out of 4 values: [NaN]

Since NaN values ARE floating-point values, we also conclude that the underlying bits of a NaN value follow the exact same structure as any floating point value does—it simply has a pattern that is reserved to indicate a NaN value. We call this an in-band encoding, and it means that one can use a single array to track all of your values, NaN and infinity included!

NaNs tl;dr

  1. NaN values represent the result of operations that are typically undefined.

  2. Operations involving NaN result in NaN, unless some special handling has been indicated.

  3. NaN values are stored in-band.

    • They ARE floating-point values, and there can only exist in arrays that have a float datatype.

Null

Null, as it is used in SQL, indicates a value is missing/does not exist. With this definition, you can see how Null is semantically adjacent to the previously discussed NaN, in the way that the concepts of "missing" and "undefined" could be semantically adjacent. But don't be misled; just because they appear to be superficially related does not mean that they are related. In fact, NaN and Null are very different. In Polars, we can specify a missing value by using Python's None. Polars doesn't use Python's None internally but just makes it easier to interop between Python and the Polars Restricted Computation domain.

df = pl.DataFrame({'data': [None, -1, 0, 1]})

df.with_columns(
    div_by_10=col('data') / 10
)
shape: (4, 2)
datadiv_by_10
i64f64
nullnull
-1-0.1
00.0
10.1

Just like our NaN values, Nulls seem to propagate across algebraic operations.

You may have noticed that we can have Null values in both integer and float data types. Does this mean that Null has some special in-band encoding for these numeric data types? Let's see:

from datetime import datetime

pl.DataFrame({
    'strings' : [None, 'a', 'b'],
    'integers': [None, 1, 2],
    'floats'   : [None, 1.0, 2.0],
    'dates'   : [None, datetime(2025, 1, 1), datetime(2025, 1, 1)]
})
shape: (3, 4)
stringsintegersfloatsdates
stri64f64datetime[μs]
nullnullnullnull
"a"11.02025-01-01 00:00:00
"b"22.02025-01-01 00:00:00

It turns out that Null values can be used in ANY datatype, which either means that there are special reserved in-band bit patterns for each of the above data types OR we are using some out-of-band encoding mechanism to propagate null values. This simply means that we use a separate array containing Boolean values that indicate which values are null and which are not.

df.with_columns(**df['data']._get_buffers())
shape: (4, 4)
datavaluesvalidityoffsets
i64i64boolnull
null0falsenull
-1-1truenull
00truenull
11truenull

Take a look at the "validity" column. When it is True, the "data" column contains an actual value, and when it is False, the "data" column reports a null. This "validity" array acts as a mask, indicating which values are null and which are not.

What's particularly fascinating is that if you examine the "values" column, you'll notice it contains a placeholder—such as 0—wherever there’s a null in the "data" column. This design has a computational advantage: even though the placeholder exists, the actual value doesn’t need to be inspected when performing calculations. The operations can remain branchless, which means they don't need to contain conditional logic (that will slow down computation) to check whether a value is null and should be skipped from computation. Instead, the "validity" mask ensures that computed values underneath the nulls are excluded from the user at the end of the computation.

Keep in mind that for certain types of data, like strings, this process involves more complexity—using offsets, the validity mask, and the values array together to reconstruct the user-facing representation.

Nulls tl;dr

  1. Null represents a lack of data.

  2. Null values are stored out-of-band.

    • A column of any data type can contain null values.

So… What's the Issue?

At the beginning of this blog post, I mentioned that the differences between NaN and Null were handled inconsistently in pandas. Despite the clear differences that we have highlighted between NaN and Null, pandas (before version 1.0.0) only supported NaN values- and had no specific support for Null values. This was, partly, due to the fact that pandas was originally built on top of NumPy which has great support for working with NaN values. NumPy does also support its own version of Null values in arrays through its lesser known masked array interface. However, NumPy's masked array has been historically slow and this may have led pandas to avoid its usage in the early days:

from numpy.random import default_rng

from numpy.ma import array as masked_array
from numpy import array, nan

rng = default_rng(0)

missing_mask = rng.integers(2, size=(size := 1_000_000)).astype('bool')
xs = rng.normal(100, 10, size=size).round(2)

xs_nan = xs.copy()
xs_nan[missing_mask] = nan

xs_mask = masked_array(xs, mask=missing_mask)

print(
    f'{xs[:6] = }',
    f'{xs_nan[:6] = }',
    f'{xs_mask[:6] = }',
    sep='\n{}\n'.format('\N{box drawings light horizontal}' * 80)
)
xs[:6] = array([105.24,  82.19, 105.14,  97.99, 101.55, 110.25])
────────────────────────────────────────────────────────────────────────────────
xs_nan[:6] = array([   nan,    nan,    nan,  97.99, 101.55, 110.25])
────────────────────────────────────────────────────────────────────────────────
xs_mask[:6] = masked_array(data=[--, --, --, 97.99, 101.55, 110.25],
             mask=[ True,  True,  True, False, False, False],
       fill_value=1e+20)
%timeit -n 5 -r 5 xs.mean()
%timeit -n 5 -r 5 xs_nan.mean()
%timeit -n 5 -r 5 xs_mask.mean()
690 μs ± 170 μs per loop (mean ± std. dev. of 5 runs, 5 loops each)
805 μs ± 244 μs per loop (mean ± std. dev. of 5 runs, 5 loops each)
11.2 ms ± 188 μs per loop (mean ± std. dev. of 5 runs, 5 loops each)

Meaning that pandas opted to simply use numpy arrays with NaN values instead of its masked array interface.

But in Polars (which of course does not use NumPy as a compute backend), we can see that the performance difference between floats with NaN values and floats with Null values is less drastic.

df = pl.DataFrame({
    's':      pl.Series(xs),
    's_nan':  pl.Series(xs_nan),
    's_null': pl.Series(xs).set(pl.Series(missing_mask), None)
})

df.head()
shape: (5, 3)
ss_nans_null
f64f64f64
105.24NaNnull
82.19NaNnull
105.14NaNnull
97.9997.9997.99
101.55101.55101.55
s, snan, snull = df['s'], df['s_nan'], df['s_null']
%timeit -n 5 -r 5 s.mean()
%timeit -n 5 -r 5 snan.mean()
%timeit -n 5 -r 5 snull.mean()
491 μs ± 154 μs per loop (mean ± std. dev. of 5 runs, 5 loops each)
602 μs ± 140 μs per loop (mean ± std. dev. of 5 runs, 5 loops each)
979 μs ± 150 μs per loop (mean ± std. dev. of 5 runs, 5 loops each)

Wrap-Up

That’s all the time we have for today! So far, we have established both the conceptual and concrete differences between Not-A-Number and null values. Additionally, I noted that NumPy has much more performative support for dealing with NaN values as opposed to masked (null) values, and in a future blog post, I will further discuss the implications this has when writing pandas code.

How do you interact with NaN and null values in your work? Has this changed your perspective? Let me know on the DUTC Discord server.

Talk to you all next time!

Table of Contents
Table of Contents