Why is DataFrame.corr() so much slower than numpy.corrcoef?#

Hey all! This week, I encountered a question that reminded me of our upcoming Performance seminar series.

I responded to this question on StackOverflow in which the author noted that calling pandas.DataFrame.corr() was much slower than calling numpy.corrcoef with the following result:

%timeit np.corrcoef(df.values)
5.17 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

%timeit df.T.corr()
8min 49s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

Seeing this result, they reached a reasonable conclusion that pandas clearly is not using the same correlation function as NumPy as there is a HUGE speed difference between the two.

Why might this be?

pandas Abstractions and Conveniences#

For the most part, pandas is built on top of NumPy as PyArrow hasn’t fully replaced all numpy based mechanics at the time of writing!

This means that in many cases, pandas can simply call NumPy functions to do the heavy lifting on most of its operations.

Profiling Code in a Notebook

For the jupyter magic prun profiles below, the line of code being executed allows us to see all of the underlying function calls being made. The -l flag applies a filter to show only a subset of the function calls.

from pandas import Series

s = Series([1,2,3])
%prun -l numpy s.sum()
#          117 function calls in 0.000 seconds

#    Ordered by: internal time
#    List reduced from 57 to 4 due to restriction <'numpy'>

#    ncalls  tottime  percall  cumtime  percall filename:lineno(function)
#         1    0.000    0.000    0.000    0.000 {method 'reduce' of 'numpy.ufunc' objects}
#         1    0.000    0.000    0.000    0.000 {method 'sum' of 'numpy.ndarray' objects}
#         4    0.000    0.000    0.000    0.000 {built-in method numpy.seterrobj}
#         8    0.000    0.000    0.000    0.000 {built-in method numpy.geterrobj}
from numpy import array

x = array([1,2,3])
%prun -l numpy s.sum()
#          116 function calls in 0.001 seconds

#    Ordered by: internal time
#    List reduced from 56 to 4 due to restriction <'numpy'>

#    ncalls  tottime  percall  cumtime  percall filename:lineno(function)
#         1    0.000    0.000    0.000    0.000 {method 'reduce' of 'numpy.ufunc' objects}
#         4    0.000    0.000    0.000    0.000 {built-in method numpy.seterrobj}
#         1    0.000    0.000    0.000    0.000 {method 'sum' of 'numpy.ndarray' objects}
#         8    0.000    0.000    0.000    0.000 {built-in method numpy.geterrobj}

From the above example, we can see that pandas.Series.sum and numpy.ndarray.sum end up calling similar code paths. Given that we can use profiling to investigate what our abstractions (Series and array) are doing with our underlying data, let’s apply this to the correlation question we have at hand.

Investigating Performance#

To investigate this problem, we’ll need to come up with a dataset, something large enough that we can see marked performance differences, but not so large that we need to wait for a very long for the function calls to return. I settled on a DataFrame/array that has a shape of (100,000 x 20) of floating point values.

from pandas import DataFrame
from numpy import nan, corrcoef
from numpy.random import default_rng


rng = default_rng(0)
df = DataFrame(rng.normal(0, 3, size=(100_000, 20)))

df.head()
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
0 0.377191 -0.396315 1.921268 0.314700 -1.607008 1.084785 3.912000 2.841243 -2.111206 -3.796264 -1.869823 0.123978 -6.975092 -0.656375 -3.737733 -2.196802 -1.632777 -0.948900 1.234892 3.127540
1 -0.385604 4.099390 -1.995584 1.054530 2.710411 0.282037 -2.230498 -2.765176 -1.373177 0.660585 -3.028855 -0.627527 -0.477675 1.622537 0.643977 1.066118 -1.961486 -0.388841 2.351926 4.480293
2 -3.777197 4.541771 4.037626 2.343934 0.793367 -0.941768 4.374062 5.880775 5.404905 3.945311 1.072141 -3.624956 -0.013362 1.969425 -3.865084 1.185366 1.289591 2.088128 -3.552354 -1.985108
3 -1.309306 -3.509406 5.218104 -1.487732 0.986909 -0.775718 4.750419 3.961083 1.900058 -6.610530 0.156087 2.051059 3.011885 -1.853721 5.466034 -3.961293 -1.984584 2.805150 0.147164 6.007178
4 0.565558 -1.899582 -1.132691 -3.273438 -3.833040 1.891234 1.743497 3.883676 -2.263817 5.067322 -0.862163 4.723225 -1.298358 -2.206450 0.749356 3.094359 0.483029 -1.756586 -4.023659 -4.204561

The first thing we want to verify is that these two functions produce the same results. Remember that, for DataFrames, the major axis is the columns, whereas in a 2d ndarray, the major axis is the rows. So, we’ll need to specify this change in shape by either transposing our numpy.ndarray or by specifying rowvar=False when calling numpy.corrcoef.

from numpy import allclose

assert allclose(df.corr(), corrcoef(df.to_numpy(), rowvar=False))

Seems that our outputs match up! Let’s take a slightly deeper dive by profiling the code we ran.

%prun -l .1 df.corr()
#          175 function calls (169 primitive calls) in 0.121 seconds

#    Ordered by: internal time
#    List reduced from 91 to 9 due to restriction <0.1>

#    ncalls  tottime  percall  cumtime  percall filename:lineno(function)
#         1    0.095    0.095    0.095    0.095 {pandas._libs.algos.nancorr}
#         1    0.023    0.023    0.023    0.023 {method 'copy' of 'numpy.ndarray' objects}
#         1    0.002    0.002    0.002    0.002 missing.py:268(_isna_array)
#         1    0.000    0.000    0.026    0.026 managers.py:1721(as_array)
#         1    0.000    0.000    0.121    0.121 frame.py:10228(corr)
#         2    0.000    0.000    0.000    0.000 frame.py:609(__init__)
#         1    0.000    0.000    0.121    0.121 {built-in method builtins.exec}
#        39    0.000    0.000    0.000    0.000 {built-in method builtins.isinstance}
#         1    0.000    0.000    0.002    0.002 missing.py:191(_isna)
from numpy import corrcoef
arr = df.to_numpy().T

%prun -l .1 corrcoef(arr)
#          94 function calls (83 primitive calls) in 0.011 seconds

#    Ordered by: internal time
#    List reduced from 64 to 6 due to restriction <0.1>

#    ncalls  tottime  percall  cumtime  percall filename:lineno(function)
#      12/1    0.004    0.000    0.011    0.011 {built-in method numpy.core._multiarray_umath.implement_array_function}
#         1    0.003    0.003    0.011    0.011 function_base.py:2462(cov)
#         1    0.002    0.002    0.002    0.002 {method 'reduce' of 'numpy.ufunc' objects}
#         2    0.001    0.001    0.001    0.001 {built-in method numpy.array}
#         1    0.000    0.000    0.011    0.011 function_base.py:2689(corrcoef)
#         1    0.000    0.000    0.003    0.003 _methods.py:162(_mean)

Clearly, pandas is taking a very different approach than NumPy is. Specifically, we can see that pandas spends a lot of time on the pandas._libs.algos.nancorr where we might reason that the core computation is taking place.

Right away, given the name of the function nancorr we can begin to suspect that there is going to be a difference in how numpy.corrcoef and pandas.DataFrame.corr handle nan values. Given that the above test verified that the outputs of these functions were the same for the provided input, let’s introduce some nan values to see if this holds.

df.loc[3, 2] = nan

assert allclose(df.corr(), corrcoef(df.to_numpy().T))
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
Input In [12], in <cell line: 3>()
      1 df.loc[3, 2] = nan
----> 3 assert allclose(df.corr(), corrcoef(df.to_numpy().T))

AssertionError: 

Seems like we’ve found the difference! For most people we would stop here and simply chalk these differences up to nan handling. But, let’s dive just a little bit deeper. Let’s see where nan values appear in the output with the change we’ve introduced to the raw data.

df.corr().round(2).where(lambda d: d.isna(), '')
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
DataFrame(corrcoef(df, rowvar=False)).round(2).where(lambda d: d.isna(), '')
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
0 NaN
1 NaN
2 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 NaN
9 NaN
10 NaN
11 NaN
12 NaN
13 NaN
14 NaN
15 NaN
16 NaN
17 NaN
18 NaN
19 NaN

Seems like the NumPy implementation results in nans across the entire row and column where we introduced the nan value, whereas pandas removes nan values from each pairwise combination of columns. In fact, if you take a look at the implementation in pandas, it is doing exactly this: https://github.com/pandas-dev/pandas/blob/3e913c27faeb76cf3c6f01e688a2eeee3762980f/pandas/_libs/algos.pyx#L343-L394 by performing online calculations of the descriptive variables necessary to estimate the Pearson r coefficient.

For some more notes, check out my answer on StackOverflow where I go into some more depth on the pandas implementation.

Wrap Up#

This takes us to the end of this week’s blog post! Hope you all enjoyed it and learned a thing or two about profiling your code to guide your investigations of performance trade offs. The answer is always in the source code!

If you want to learn more about how to harness your code’s performance, check out our upcoming Performance seminar series. You won’t want to miss it!

Talk to you all next week.