Pandas: SettingWithCopyWarning

Wrapping up June already?! I can't believe how quickly things are moving.

I wanted to take some time today to discuss one of the most common issues facing pandas users: SettingWithCopyWarning

This warning indicates to the user that they are assigning values to a DataFrame that originated as a subset of another DataFrame. While this may sound completely fine like a completely fine operation on the surface, the underlying view/copy behaviors in pandas leads to a sometimes ambiguous behavior where user code can have unintended side effects on the data.

Simple Mutation Assignment

from pandas import DataFrame

df = DataFrame(0, index=[*'abc'], columns=[*'xyz'])

print(df, end=(hline := '\n{}\n'.format('\N{box drawings light horizontal}' * 40)))
df.loc['b', 'y'] = 1
print(df)

   x  y  z
a  0  0  0
b  0  0  0
c  0  0  0
────────────────────────────────────────
   x  y  z
a  0  0  0
b  0  1  0
c  0  0  0

The above snippet can perform inplace assignment, which simply means that the underlying DataFrame is mutated. This does not mean that the operation is performed on a view or a copy of the original as there are many more factors at play when pandas determines whether we operate with views or copies.

Views & copies aside, we can fully assess whether we updated our data successfully by noting the change in the above snippet. Let’s take a look at another example

Subset Operation Propagation

from numpy import arange
from pandas import DataFrame

df = DataFrame(arange(12).reshape(4, 3), index=[*'abcd'], columns=[*'xyz'])
df_subset = df.loc[lambda d: d['x'] < 5]

print(df_subset, end=hline)
df_subset.loc['a', 'y'] = 99
print(df_subset, df, sep=hline)

   x  y  z
a  0  1  2
b  3  4  5
────────────────────────────────────────
   x   y  z
a  0  99  2
b  3   4  5
────────────────────────────────────────
   x   y   z
a  0   1   2
b  3   4   5
c  6   7   8
d  9  10  11

By performing the same assignment operation df_subset.loc['a', 'y'] = 1 as I did in the first example, I'm not eliciting the SettingWithCopyWarning. What is going on, and what is pandas trying to communicate to me?

As I hinted at before, pandas is notifying us of an ambiguity. Here, we are operating on a subset of a larger DataFrame, and pandas is pointing that it is unsure whether or not we wanted the change we made to df_subset to propagate back to the original df.

If we use a slice instead of a boolean mask to subset our dataframe, then we can get some very interesting results.

from numpy import arange
from pandas import DataFrame

df = DataFrame(arange(12).reshape(4, 3), index=[*'abcd'], columns=[*'xyz'])
df_subset = df.loc['a':'c']

print(df_subset, end=hline)
df_subset.loc['a', 'y'] = 99
print(df_subset, df, sep=hline)

   x  y  z
a  0  1  2
b  3  4  5
c  6  7  8
────────────────────────────────────────
   x   y  z
a  0  99  2
b  3   4  5
c  6   7  8
────────────────────────────────────────
   x   y   z
a  0  99   2
b  3   4   5
c  6   7   8
d  9  10  11

Wait… now we elicited the SettingWithCopyWarning and we changed the original DataFrame?

This is exactly the ambiguity that pandas is attempting to warn us of. We may mistakenly mutate a DataFrame by operating on a subset of its data. This entirely has to do with when our subset is a true view of underlying data vs a copy of the dataset. Unfortunately, the rules that govern when pandas makes a copy vs a view are quite opaque and require in-depth knowledge of to even begin making sense. In the example above, we can follow the same view & copy rules numpy uses because the underlying data in our DataFrame can be represented densely as a singly dtype'd numpy.array. However with real world data, these same rules typically cannot be followed because of the complexity that is brought in when dealing with heterogeneously dtyped columns.

Outside of simple cases, it’s very hard to predict whether it [df.__getitem__] will return a view or a copy (it depends on the memory layout of the array, about which pandas makes no guarantees)…

Circumventing SettingWithCopyWarning

Dealing with SettingWithCopyWarning's has had a long history in pandas. The core developers have a proposal to change the behavior of __getitem__ to enforce a more consistent user experience. By returning a new Series or DataFrame from __getitem__ and implementing a copy on write policy (meaning underlying data is not copied until one attempts to overwrite/update it), this proposal hopes to smooth the user experience when using pandas by nearly eliminating the idea that pandas uses views or copies at all.

When implemented, proposal will be completely backwards incompatible with previous pandas code. Thankfully, there are conceptual approaches we can use that eliminate the need to ever encounter the SettingWithCopyWarning.

First up, we have the idea of working in masks instead of with subsets. When we work with masks, we avoid subsetting our data entirely until necessary. This avoids creating intermediate copies/views of data to remove ambiguity from our code.

Working With Masks

df = DataFrame(arange(12).reshape(4, 3), index=[*'abcd'], columns=[*'xyz'])

# Pre-compute boolean masks of subset interest
x_lt_5 = df['x'] < 5
y_eq_7 = df['y'] == 7

# Perform the subsetting only when we need to operate
df.loc[x_lt_5 | y_eq_7, 'z'] = 99

print(df)

   x   y   z
a  0   1  99
b  3   4  99
c  6   7  99
d  9  10  11

When we work with masks in this fashion, we acknowledge that we’re not interested in updating independent subsets of data, but dealing with the dataset as a whole. Whenever we do need a subset, we simply combine and apply our masks as needed to perform data manipulations.

Defensive .copy

However we often do want to work with our data as separate subsets. In this case, how else can I reduce ambiguity? Well, you can tell pandas pretty explicitly that you want to work with a .copy and not a view.

df = DataFrame(arange(12).reshape(4, 3), index=[*'abcd'], columns=[*'xyz'])

# Pre-compute boolean masks of subset interest
subset_df = df.loc[df['x'] < 5].copy()

# Perform the subsetting only when we need to operate
subset_df['y'] = 99

print(subset_df, df, sep=hline)

   x   y  z
a  0  99  2
b  3  99  5
────────────────────────────────────────
   x   y   z
a  0   1   2
b  3   4   5
c  6   7   8
d  9  10  11

You can see the above code successfully created a subset, assigned new values to that subset, avoided propagating the change to the broader data set, and did not raise the SettingWithCopyWarning. I guess explicit is better than implicit after all.

Wrap Up

That’s all for this week, thanks for reading through and I hope you can avoid the dreaded SettingWithCopyWarning in your code! Keep an eye on that proposal too, as it might be acted on sooner than you think. Talk to you all next week.