Why is DataFrame.corr()
so much slower than numpy.corrcoef
?#
Hey all! This week, I encountered a question that reminded me of our upcoming Performance seminar series.
I responded to this question on StackOverflow in which the author noted that calling pandas.DataFrame.corr()
was much slower than calling
numpy.corrcoef
with the following result:
%timeit np.corrcoef(df.values)
5.17 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
%timeit df.T.corr()
8min 49s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
Seeing this result, they reached a reasonable conclusion that pandas clearly is not using the same correlation function as NumPy as there is a HUGE speed difference between the two.
Why might this be?
pandas Abstractions and Conveniences#
For the most part, pandas is built on top of NumPy as PyArrow hasn’t fully replaced
all numpy
based mechanics at the time of writing!
This means that in many cases, pandas can simply call NumPy functions to do the heavy lifting on most of its operations.
Profiling Code in a Notebook
For the jupyter magic prun
profiles below, the line of code being executed allows us to see all of the underlying
function calls being made. The -l
flag applies a filter to show only a subset of
the function calls.
from pandas import Series
s = Series([1,2,3])
%prun -l numpy s.sum()
# 117 function calls in 0.000 seconds
# Ordered by: internal time
# List reduced from 57 to 4 due to restriction <'numpy'>
# ncalls tottime percall cumtime percall filename:lineno(function)
# 1 0.000 0.000 0.000 0.000 {method 'reduce' of 'numpy.ufunc' objects}
# 1 0.000 0.000 0.000 0.000 {method 'sum' of 'numpy.ndarray' objects}
# 4 0.000 0.000 0.000 0.000 {built-in method numpy.seterrobj}
# 8 0.000 0.000 0.000 0.000 {built-in method numpy.geterrobj}
from numpy import array
x = array([1,2,3])
%prun -l numpy s.sum()
# 116 function calls in 0.001 seconds
# Ordered by: internal time
# List reduced from 56 to 4 due to restriction <'numpy'>
# ncalls tottime percall cumtime percall filename:lineno(function)
# 1 0.000 0.000 0.000 0.000 {method 'reduce' of 'numpy.ufunc' objects}
# 4 0.000 0.000 0.000 0.000 {built-in method numpy.seterrobj}
# 1 0.000 0.000 0.000 0.000 {method 'sum' of 'numpy.ndarray' objects}
# 8 0.000 0.000 0.000 0.000 {built-in method numpy.geterrobj}
From the above example, we can see that pandas.Series.sum
and numpy.ndarray.sum
end up calling similar code paths. Given that we can use profiling to investigate what
our abstractions (Series
and array
) are doing with our underlying data, let’s
apply this to the correlation question we have at hand.
Investigating Performance#
To investigate this problem, we’ll need to come up with a dataset, something large enough that we can see marked performance differences, but not so large that we need to wait for a very long for the function calls to return. I settled on a DataFrame/array that has a shape of (100,000 x 20) of floating point values.
from pandas import DataFrame
from numpy import nan, corrcoef
from numpy.random import default_rng
rng = default_rng(0)
df = DataFrame(rng.normal(0, 3, size=(100_000, 20)))
df.iloc[:, :7].head()
0 | 1 | 2 | 3 | 4 | 5 | 6 | |
---|---|---|---|---|---|---|---|
0 | 0.377191 | -0.396315 | 1.921268 | 0.314700 | -1.607008 | 1.084785 | 3.912000 |
1 | -0.385604 | 4.099390 | -1.995584 | 1.054530 | 2.710411 | 0.282037 | -2.230498 |
2 | -3.777197 | 4.541771 | 4.037626 | 2.343934 | 0.793367 | -0.941768 | 4.374062 |
3 | -1.309306 | -3.509406 | 5.218104 | -1.487732 | 0.986909 | -0.775718 | 4.750419 |
4 | 0.565558 | -1.899582 | -1.132691 | -3.273438 | -3.833040 | 1.891234 | 1.743497 |
The first thing we want to verify is that these two functions produce the same results.
Remember that, for DataFrame
s, the major axis is the columns, whereas in a 2d ndarray
, the
major axis is the rows. So, we’ll need to specify this change in shape by either transposing
our numpy.ndarray
or by specifying rowvar=False
when calling numpy.corrcoef
.
from numpy import allclose
assert allclose(df.corr(), corrcoef(df.to_numpy(), rowvar=False))
Seems that our outputs match up! Let’s take a slightly deeper dive by profiling the code we ran.
%prun -l .1 df.corr()
# 175 function calls (169 primitive calls) in 0.121 seconds
# Ordered by: internal time
# List reduced from 91 to 9 due to restriction <0.1>
# ncalls tottime percall cumtime percall filename:lineno(function)
# 1 0.095 0.095 0.095 0.095 {pandas._libs.algos.nancorr}
# 1 0.023 0.023 0.023 0.023 {method 'copy' of 'numpy.ndarray' objects}
# 1 0.002 0.002 0.002 0.002 missing.py:268(_isna_array)
# 1 0.000 0.000 0.026 0.026 managers.py:1721(as_array)
# 1 0.000 0.000 0.121 0.121 frame.py:10228(corr)
# 2 0.000 0.000 0.000 0.000 frame.py:609(__init__)
# 1 0.000 0.000 0.121 0.121 {built-in method builtins.exec}
# 39 0.000 0.000 0.000 0.000 {built-in method builtins.isinstance}
# 1 0.000 0.000 0.002 0.002 missing.py:191(_isna)
from numpy import corrcoef
arr = df.to_numpy().T
%prun -l .1 corrcoef(arr)
# 94 function calls (83 primitive calls) in 0.011 seconds
# Ordered by: internal time
# List reduced from 64 to 6 due to restriction <0.1>
# ncalls tottime percall cumtime percall filename:lineno(function)
# 12/1 0.004 0.000 0.011 0.011 {built-in method numpy.core._multiarray_umath.implement_array_function}
# 1 0.003 0.003 0.011 0.011 function_base.py:2462(cov)
# 1 0.002 0.002 0.002 0.002 {method 'reduce' of 'numpy.ufunc' objects}
# 2 0.001 0.001 0.001 0.001 {built-in method numpy.array}
# 1 0.000 0.000 0.011 0.011 function_base.py:2689(corrcoef)
# 1 0.000 0.000 0.003 0.003 _methods.py:162(_mean)
Clearly, pandas is taking a very different approach than NumPy is. Specifically, we can
see that pandas spends a lot of time on the pandas._libs.algos.nancorr
where
we might reason that the core computation is taking place.
Right away, given the name of the function nancorr
we can begin to suspect that
there is going to be a difference in how numpy.corrcoef
and pandas.DataFrame.corr
handle nan
values. Given that the above test verified that the outputs of these
functions were the same for the provided input, let’s introduce some nan
values
to see if this holds.
df.loc[3, 2] = nan
assert allclose(df.corr(), corrcoef(df.to_numpy().T))
---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
Input In [12], in <cell line: 3>()
1 df.loc[3, 2] = nan
----> 3 assert allclose(df.corr(), corrcoef(df.to_numpy().T))
AssertionError:
Seems like we’ve found the difference! For most people we would stop here and simply
chalk these differences up to nan
handling. But, let’s dive just a little bit deeper.
Let’s see where nan
values appear in the output with the change we’ve introduced
to the raw data.
df.corr().round(2).where(lambda d: d.isna(), '')
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | ||||||||||||||||||||
1 | ||||||||||||||||||||
2 | ||||||||||||||||||||
3 | ||||||||||||||||||||
4 | ||||||||||||||||||||
5 | ||||||||||||||||||||
6 | ||||||||||||||||||||
7 | ||||||||||||||||||||
8 | ||||||||||||||||||||
9 | ||||||||||||||||||||
10 | ||||||||||||||||||||
11 | ||||||||||||||||||||
12 | ||||||||||||||||||||
13 | ||||||||||||||||||||
14 | ||||||||||||||||||||
15 | ||||||||||||||||||||
16 | ||||||||||||||||||||
17 | ||||||||||||||||||||
18 | ||||||||||||||||||||
19 |
DataFrame(corrcoef(df, rowvar=False)).round(2).where(lambda d: d.isna(), '')
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | NaN | |||||||||||||||||||
1 | NaN | |||||||||||||||||||
2 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3 | NaN | |||||||||||||||||||
4 | NaN | |||||||||||||||||||
5 | NaN | |||||||||||||||||||
6 | NaN | |||||||||||||||||||
7 | NaN | |||||||||||||||||||
8 | NaN | |||||||||||||||||||
9 | NaN | |||||||||||||||||||
10 | NaN | |||||||||||||||||||
11 | NaN | |||||||||||||||||||
12 | NaN | |||||||||||||||||||
13 | NaN | |||||||||||||||||||
14 | NaN | |||||||||||||||||||
15 | NaN | |||||||||||||||||||
16 | NaN | |||||||||||||||||||
17 | NaN | |||||||||||||||||||
18 | NaN | |||||||||||||||||||
19 | NaN |
Seems like the NumPy implementation results in nan
s across the entire row and column where
we introduced the nan
value, whereas pandas removes nan
values from each pairwise
combination of columns. In fact, if you take a look at the implementation in pandas, it
is doing exactly this
by performing online calculations of the descriptive variables necessary to estimate the
Pearson r coefficient.
For some more notes, check out my answer on StackOverflow where I go into some more depth on the pandas implementation.
Wrap Up#
This takes us to the end of this week’s blog post! Hope you all enjoyed it and learned a thing or two about profiling your code to guide your investigations of performance trade offs. The answer is always in the source code!
If you want to learn more about how to harness your code’s performance, check out our upcoming Performance seminar series. You won’t want to miss it!
Talk to you all next week.