More NaN vs Null differences in pandas & Polars
Hello, everyone! Welcome back to Cameron's Corner! Two weeks ago, I shared a blog post that explored the differences between NaN and Null values. This week, I want to apply what was covered in that discussion to our favorite DataFrame libraries: pandas and Polars.
from traceback import print_exc
from contextlib import contextmanager
from sys import stdout
@contextmanager
def short_traceback(limit=1):
try:
yield
except Exception:
print_exc(limit=limit, file=stdout)
NaN vs NA in pandas
Historically speaking, pandas was built solely on top of NumPy. Therefore, it has a strong level of compatibility with numpy.nan, which I will refer to as NaN. The implementation of numpy.NaN follows the IEEE-754 standard very closely and exhibits all of the properties you would expect it to (is technically a float64 value and is not equivalent to itself).
from numpy import nan
print(
f'{type(nan) = }',
f'{nan == nan = }',
sep='\n',
)
type(nan) = <class 'float'>
nan == nan = False
However, pandas' reliance on NumPy introduced some rough edges for end users who did not want to have their data be treated as float64's whenever a NaN was introduced into the values.
import pandas as pd
with short_traceback():
s = pd.Series([nan, 1, 2], dtype='int64')
Traceback (most recent call last):
File "/tmp/ipykernel_505185/463154796.py", line 8, in short_traceback
yield
ValueError: cannot convert float NaN to integer
This showed a gap in how pandas used the NaN value: it could not be used reliably across datatypes.
One workaround, as we see in the handling of datetime values in pandas, was to introduce a new NaN equivalent that is unique to a given datatype: NaT (Not-a-Time).
pd.Series([nan, 1, 2], dtype='datetime64[ns]')
0 NaT
1 1970-01-01 00:00:00.000000001
2 1970-01-01 00:00:00.000000002
dtype: datetime64[ns]
However, we should keep in mind that pandas supports more than 14 basic/extended datatypes via NumPy without considering their deeper implementations:
integer
unsigned integer
float
complex
boolean
object
datetime64[ns]
datetime64[ns, tz]
timedelta64[ns]
category
interval
period
sparse
Given the above, we can see how much code writing and rewriting would be involved if we had to create a NaN-like value that is unique to each of these datatypes. And, in pandas version 1.0.0, the core development team released pandas.NA to solve this problem. However, in order to maintain backwards compatibility with the existing behaviors pandas had established prior to this release, a design decision was made that contains a subset of datatypes that fully support the new pandas.NA object. Specifically, any datatype that begins with a capital letter (e.g., Float64 or Int64) will use pandas.NA as a NaN-like value, whereas a datatype that begins with a lowercase letter (e.g., float64 or int64) will be much more closely tied to NumPy's implementation.
These "capital letter" datatypes are what pandas refers to formally as "nullable" datatypes.
pandas NA
pandas.NA behaves more like a Null value than it does a NaN value as it supports more datatypes than only floating-point values and checking its equivalency against itself, simply returns another Null result whereas NaN-like behavior would return False (as NaN != NaN; according to IEEE-754). At the same time, pandas wanted to be able to use the same NaN-aware functions (like Series.dropna, Series.fillna, etc.) with this new value. As a result, we end up with a hybrid behavior where pandas.NA behaves like both a NaN and a Null value.
import pandas as pd
df = (
pd.Series(
[0, nan, 2, pd.NA],
dtype='Float64', # Float64 is a pandas nullable datatype (uses pd.NA)
name='data'
)
.to_frame()
.assign(
is_na=lambda d: d['data'].isna(), # isna() does not differentiate NaN vs NA
as_int=lambda d: d['data'].astype('Int64'), # float64 is a numpy backed dtype
eq=lambda d: d['data'] == d['data']
)
)
df
data | is_na | as_int | eq | |
---|---|---|---|---|
0 | 0.0 | False | 0 | True |
1 | <NA> | True | <NA> | <NA> |
2 | 2.0 | False | 2 | True |
3 | <NA> | True | <NA> | <NA> |
Above, we see a few things:
Converting converting to the nullable Float64 datatype pandas will coerce any NaN values (including float('nan') and numpy.nan) to its internal pandas.NA value.
pandas.NA values can be used in either Float64 or Int64 datatypes
pandas.NA return pandas.NA when it is compared with itself.
However, if we naturally arrive at a NaN value (e.g., by dividing 0/0 or some similar operation) we can see that there are still rough edges to be handled here. This also showcases that the NA support is implemented as a layer on top of NumPy arrays—we haven't fully changed the array backends here.
result = df['data'] / df['data']
display(
result, # remember 0 / 0 → NaN
f'{type(result[0]) = }' # the NaN is a numpy.nan!
)
0 NaN
1 <NA>
2 1.0
3 <NA>
Name: data, dtype: Float64
"type(result[0]) = <class 'numpy.float64'>"
When we have a scenario where we have both NaN and NA values (which can only happen in Float64 arrays), we see that pandas only considers the NA, and not the NaN values.
result.dropna() # ???
0 NaN
2 1.0
Name: data, dtype: Float64
While it is rare that you would find yourself in this corner of pandas—we have intentionally created an interesting edge-case and it is important to know how these values are considered internally. pandas.NA is not a full answer to the missing distinction between NaN and Null, but it is one that works for the majority of pandas users and use cases. Let's not forget that pandas has been the leading Python DataFrame tool for over a decade!
With that said, the core developers are aware of these caveats and there has been an ongoing discussion amongst core developers (since 2020, not long after the release of pandas 1.0 and the introduction of pandas.NA) on how to move forward from here. Currently it seems like there is favor of pandas enhancement proposal 16 (PDEP-16) which explores this topic in great depth, the primary concern with enacting these types of changes is backwards compatibility—let's not forget that pandas has many millions of users so change is going to be slow and thoughtful (special thanks to @Marco Gorelli for pointing me towards these discussions).
NaN vs Null in Polars
This week I wanted to finish the discussion of NaN and Null in DataFrame tools, so let's wrap things up with Polars!
Since Polars is a much newer tool, there is less historical discussion to be had here. In my opinion, Polars made a lot of good decisions that better enables the interplay between both NaN and Null values while preserving their unique semantic meaning and most of their respective historic behaviors.
When working with Polars and considering NaN/Nulls, the four most important points to keep in mind are:
NaN values are restricted to the Float64 datatype, while Null values can appear in ANY datatype.
Algebraic operations on NaN values result in a NaN value
Comparisons against NaN values are deterministic, NaN == NaN and NaN > everything.
Any operation that is executed on a Null value (aside from .is_null()) results in a Null value.
Below is a simple example demonstrating how Polars handles the detection of both NaN and Null values. Notice how we use two separate methods, .is_nan() and .is_null(), to identify these cases.
import polars as pl
df = (
pl.Series('data', [0, float('nan'), 2, None], dtype=pl.Float64).to_frame()
.with_columns(
is_null=pl.col('data').is_null(),
is_na =pl.col('data').is_nan(), # note that Null values are still null here!
)
)
df
data | is_null | is_na |
---|---|---|
f64 | bool | bool |
0.0 | false | false |
NaN | false | true |
2.0 | false | false |
null | true | null |
Next, let's see how algebraic operations propagate NaN values, aligning with the behavior expected under IEEE-754 standards. Operations performed on NaN values continue to yield NaN, ensuring consistency in numerical computations.
df.select( # algebraic operations return NaN values; in line with IEE-754
pl.col('data'),
add=pl.col('data') + 1,
sub=pl.col('data') - 1,
mul=pl.col('data') * 1,
)
data | add | sub | mul |
---|---|---|---|
f64 | f64 | f64 | f64 |
0.0 | 1.0 | -1.0 | 0.0 |
NaN | NaN | NaN | NaN |
2.0 | 3.0 | 1.0 | 2.0 |
null | null | null | null |
As you can see, regardless of the arithmetic operation performed, both the NaN and Null values remains preserved. But what happens when we use comparators instead of algebraic operations? Here we start to see some of the behavioral differences these two entities!
df.select( # notice that NaN results are in line with the rest of the numeric values behavior!
pl.col('data'),
eq=pl.col('data') == pl.col('data'),
lt=pl.col('data') < pl.col('data'),
gt=pl.col('data') > pl.col('data'),
)
data | eq | lt | gt |
---|---|---|---|
f64 | bool | bool | bool |
0.0 | true | false | false |
NaN | true | false | false |
2.0 | true | false | false |
null | null | null | null |
Polars NaN Behaviors Deviations
While Polars adheres to many aspects of IEEE-754, it also intentionally deviates in some key areas to facilitate more efficient data processing. This is especially true when it comes to how NaN values are treated during comparisons and sorting. These deviations are designed to eliminate the need for additional branching in performance-critical code.
Below is a comparison table that highlights the primary differences between the IEEE-754 NaN behavior and how Polars handles NaN values:
Aspect |
IEEE-754 NaN Behavior |
Polars Behavior |
---|---|---|
Comparison Operations |
Comparisons (<, >, etc.) with NaN always return false (unordered). |
NaN is treated as greater than all other numbers. |
Equality Check |
NaN is not equal to any value, including itself (i.e., NaN != NaN). |
NaN is equal to itself! NaN == NaN |
These deviations from the IEEE-754 NaN behaviors likely exist to enhance the performance of the underlying operations. In particular, equality checking and sorting would otherwise require extra branching logic to correctly handle NaN values.
To further illustrate the branching challenges, consider the following Python example that simulates how a traditional comparison function might operate under strict IEEE-754 rules:
def compare(x, y):
if x < y:
return -1
elif x > y:
return 1
elif x == y:
return 0
# this exhausts our comparison for all typical numbers!
raise ValueError('How did you get here?')
inf = float('inf')
print(
f'{compare(1, 0) = }',
f'{compare(0, 1) = }',
f'{compare(1, 1) = }',
f'{compare(inf , 1) = }',
f'{compare(-inf, 1) = }',
sep='\n',
)
compare(1, 0) = 1
compare(0, 1) = -1
compare(1, 1) = 0
compare(inf , 1) = 1
compare(-inf, 1) = -1
In this example, the function compares ordinary numbers well. However, if we try to compare NaN values, the IEEE-754 standard would require extra checks to properly manage unordered comparisons. Let's see what happens when we try to pass a NaN into our comparison function:
compare(float('nan'), 0)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[12], line 1
----> 1 compare(float('nan'), 0)
Cell In[11], line 9, in compare(x, y)
7 return 0
8 # this exhausts our comparison for all typical numbers!
----> 9 raise ValueError('How did you get here?')
ValueError: How did you get here?
As expected, the function does not know how to handle a NaN under strict IEEE-754 semantics, which would typically require additional branching logic. Instead, Polars chooses to make NaN values comparable by default. This decision reduces the need for extra logic and branching, thus keeping computations efficient and straightforward.
Wrap-Up
In conclusion, while IEEE-754 defines NaN as unordered and not equal to any value (even itself), Polars' deviation from these strict rules offers practical benefits. The consistent handling of NaN values simplifies many operations, reduces the number of conditional checks, and leads to overall better performance in data analytics workflows.
As a final thought, I should mention that Polars isn't the only tabular data analysis that exhibits these NaN behaviors...
import duckdb
duckdb.sql('''
SELECT
data
, data == data
, data > 0
, data + 1
FROM df
''')
┌────────┬───────────────────┬──────────────┬──────────────┐
│ data │ ("data" = "data") │ ("data" > 0) │ ("data" + 1) │
│ double │ boolean │ boolean │ double │
├────────┼───────────────────┼──────────────┼──────────────┤
│ 0.0 │ true │ false │ 1.0 │
│ nan │ true │ true │ nan │
│ 2.0 │ true │ true │ 3.0 │
│ NULL │ NULL │ NULL │ NULL │
└────────┴───────────────────┴──────────────┴──────────────┘
But I'll save that discussion for later. What do you think about NaN and Null? Have you run into issues with them while using Polars? Let me know on the DUTC Discord server.
Talk to you all then!