Pandas: SettingWithCopyWarning#
Wrapping up June already?! I can’t believe how quickly things are moving.
I wanted to take some time today to discuss one of the most common issues facing pandas users: SettingWithCopyWarning
This warning indicates to the user that they are assigning values to a DataFrame
that originated as a subset of another DataFrame
. While this may sound completely fine like a completely fine operation on the surface, the underlying view/copy behaviors in pandas
leads to a sometimes ambiguous behavior where user code can have unintended side effects on the data.
Simple Mutation Assignment#
from pandas import DataFrame
df = DataFrame(0, index=[*'abc'], columns=[*'xyz'])
print(df, end=(hline := '\n{}\n'.format('\N{box drawings light horizontal}' * 40)))
df.loc['b', 'y'] = 1
print(df)
x y z
a 0 0 0
b 0 0 0
c 0 0 0
────────────────────────────────────────
x y z
a 0 0 0
b 0 1 0
c 0 0 0
The above snippet can perform inplace
assignment, which simply means that the underlying DataFrame
is mutated. This does not mean that the operation is performed on a view or a copy of the original as there are many more factors at play when pandas determines whether we operate with views or copies.
Views & copies aside, we can fully assess whether we updated our data successfully by noting the change in the above snippet. Let’s take a look at another example
Subset Operation Propagation#
from numpy import arange
from pandas import DataFrame
df = DataFrame(arange(12).reshape(4, 3), index=[*'abcd'], columns=[*'xyz'])
df_subset = df.loc[lambda d: d['x'] < 5]
print(df_subset, end=hline)
df_subset.loc['a', 'y'] = 99
print(df_subset, df, sep=hline)
x y z
a 0 1 2
b 3 4 5
────────────────────────────────────────
x y z
a 0 99 2
b 3 4 5
────────────────────────────────────────
x y z
a 0 1 2
b 3 4 5
c 6 7 8
d 9 10 11
/tmp/ipykernel_111389/1874876058.py:8: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
df_subset.loc['a', 'y'] = 99
By performing the same assignment operation df_subset.loc['a', 'y'] = 1
as I did in the first example, I’m not eliciting the SettingWithCopyWarning
. What is going on, and what is pandas
trying to communicate to me?
As I hinted at before, pandas
is notifying us of an ambiguity. Here, we are operating on a subset of a larger DataFrame
, and pandas
is pointing that it is unsure whether or not we wanted the change we made to df_subset
to propagate back to the original df
.
If we use a slice
instead of a boolean mask to subset our dataframe, then we can get some very interesting results.
from numpy import arange
from pandas import DataFrame
df = DataFrame(arange(12).reshape(4, 3), index=[*'abcd'], columns=[*'xyz'])
df_subset = df.loc['a':'c']
print(df_subset, end=hline)
df_subset.loc['a', 'y'] = 99
print(df_subset, df, sep=hline)
x y z
a 0 1 2
b 3 4 5
c 6 7 8
────────────────────────────────────────
x y z
a 0 99 2
b 3 4 5
c 6 7 8
────────────────────────────────────────
x y z
a 0 99 2
b 3 4 5
c 6 7 8
d 9 10 11
/tmp/ipykernel_111389/2727519732.py:8: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
df_subset.loc['a', 'y'] = 99
Wait… now we elicited the SettingWithCopyWarning and we changed the original DataFrame
?
This is exactly the ambiguity that pandas
is attempting to warn us of. We may mistakenly mutate a DataFrame
by operating on a subset of its data. This entirely has to do with when our subset is a true view of underlying data vs a copy of the dataset. Unfortunately, the rules that govern when pandas
makes a copy vs a view are quite opaque and require in-depth knowledge of to even begin making sense. In the example above, we can follow the same view & copy rules numpy uses because the underlying data in our DataFrame can be represented densely as a singly dtype’d numpy.array
. However with real world data, these same rules typically cannot be followed because of the complexity that is brought in when dealing with heterogeneously dtyped columns.
Outside of simple cases, it’s very hard to predict whether it [
df.__getitem__
] will return a view or a copy (it depends on the memory layout of the array, about which pandas makes no guarantees)…
Circumventing SettingWithCopyWarning#
Dealing with SettingWithCopyWarning
’s has had a long history in pandas
. The core developers have a proposal to change the behavior of __getitem__
to enforce a more consistent user experience. By returning a new Series
or DataFrame
from __getitem__
and implementing a copy on write policy (meaning underlying data is not copied until one attempts to overwrite/update it), this proposal hopes to smooth the user experience when using pandas
by nearly eliminating the idea that pandas
uses views or copies at all.
When implemented, proposal will be completely backwards incompatible with previous pandas
code. Thankfully, there are conceptual approaches we can use that eliminate the need to ever encounter the SettingWithCopyWarning
.
First up, we have the idea of working in masks instead of with subsets. When we work with masks, we avoid subsetting our data entirely until necessary. This avoids creating intermediate copies/views of data to remove ambiguity from our code.
Working With Masks#
df = DataFrame(arange(12).reshape(4, 3), index=[*'abcd'], columns=[*'xyz'])
# Pre-compute boolean masks of subset interest
x_lt_5 = df['x'] < 5
y_eq_7 = df['y'] == 7
# Perform the subsetting only when we need to operate
df.loc[x_lt_5 | y_eq_7, 'z'] = 99
print(df)
x y z
a 0 1 99
b 3 4 99
c 6 7 99
d 9 10 11
When we work with masks in this fashion, we acknowledge that we’re not interested in updating independent subsets of data, but dealing with the dataset as a whole. Whenever we do need a subset, we simply combine and apply our masks as needed to perform data manipulations.
Defensive .copy#
However we often do want to work with our data as separate subsets. In this case, how else can I reduce ambiguity? Well, you can tell pandas pretty explicitly that you want to work with a .copy
and not a view.
df = DataFrame(arange(12).reshape(4, 3), index=[*'abcd'], columns=[*'xyz'])
# Pre-compute boolean masks of subset interest
subset_df = df.loc[df['x'] < 5].copy()
# Perform the subsetting only when we need to operate
subset_df['y'] = 99
print(subset_df, df, sep=hline)
x y z
a 0 99 2
b 3 99 5
────────────────────────────────────────
x y z
a 0 1 2
b 3 4 5
c 6 7 8
d 9 10 11
You can see the above code successfully created a subset, assigned new values to that subset, avoided propagating the change to the broader data set, and did not raise the SettingWithCopyWarning
. I guess explicit is better than implicit after all.
Wrap Up#
That’s all for this week, thanks for reading through and I hope you can avoid the dreaded SettingWithCopyWarning
in your code! Keep an eye on that proposal too, as it might be acted on sooner than you think. Talk to you all next week.