# Pandas: SettingWithCopyWarning#

Wrapping up June already?! I can’t believe how quickly things are moving.

I wanted to take some time today to discuss one of the most common issues facing pandas users: **SettingWithCopyWarning**

This warning indicates to the user that they are assigning values to a `DataFrame`

that originated as a subset of another `DataFrame`

. While this may sound completely fine like a completely fine operation on the surface, the underlying view/copy behaviors in `pandas`

leads to a sometimes ambiguous behavior where user code can have unintended side effects on the data.

## Simple Mutation Assignment#

```
from pandas import DataFrame
df = DataFrame(0, index=[*'abc'], columns=[*'xyz'])
print(df, end=(hline := '\n{}\n'.format('\N{box drawings light horizontal}' * 40)))
df.loc['b', 'y'] = 1
print(df)
```

```
x y z
a 0 0 0
b 0 0 0
c 0 0 0
────────────────────────────────────────
x y z
a 0 0 0
b 0 1 0
c 0 0 0
```

The above snippet can perform `inplace`

assignment, which simply means that the underlying `DataFrame`

is mutated. This does not mean that the operation is performed on a view or a copy of the original as there are many more factors at play when pandas determines whether we operate with views or copies.

Views & copies aside, we can fully assess whether we updated our data successfully by noting the change in the above snippet. Let’s take a look at another example

## Subset Operation Propagation#

```
from numpy import arange
from pandas import DataFrame
df = DataFrame(arange(12).reshape(4, 3), index=[*'abcd'], columns=[*'xyz'])
df_subset = df.loc[lambda d: d['x'] < 5]
print(df_subset, end=hline)
df_subset.loc['a', 'y'] = 99
print(df_subset, df, sep=hline)
```

```
x y z
a 0 1 2
b 3 4 5
────────────────────────────────────────
x y z
a 0 99 2
b 3 4 5
────────────────────────────────────────
x y z
a 0 1 2
b 3 4 5
c 6 7 8
d 9 10 11
```

```
/tmp/ipykernel_2288008/1874876058.py:8: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
df_subset.loc['a', 'y'] = 99
```

By performing the same assignment operation `df_subset.loc['a', 'y'] = 1`

as I did in the first example, I’m not eliciting the `SettingWithCopyWarning`

. What is going on, and what is `pandas`

trying to communicate to me?

As I hinted at before, `pandas`

is notifying us of an ambiguity. Here, we are operating on a subset of a larger `DataFrame`

, and `pandas`

is pointing that it is unsure whether or not we wanted the change we made to `df_subset`

to propagate back to the original `df`

.

If we use a `slice`

instead of a boolean mask to subset our dataframe, then we can get some very interesting results.

```
from numpy import arange
from pandas import DataFrame
df = DataFrame(arange(12).reshape(4, 3), index=[*'abcd'], columns=[*'xyz'])
df_subset = df.loc['a':'c']
print(df_subset, end=hline)
df_subset.loc['a', 'y'] = 99
print(df_subset, df, sep=hline)
```

```
x y z
a 0 1 2
b 3 4 5
c 6 7 8
────────────────────────────────────────
x y z
a 0 99 2
b 3 4 5
c 6 7 8
────────────────────────────────────────
x y z
a 0 99 2
b 3 4 5
c 6 7 8
d 9 10 11
```

```
/tmp/ipykernel_2288008/2727519732.py:8: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
df_subset.loc['a', 'y'] = 99
```

Wait… now we elicited the SettingWithCopyWarning and we changed the original `DataFrame`

?

This is exactly the ambiguity that `pandas`

is attempting to warn us of. We may mistakenly mutate a `DataFrame`

by operating on a subset of its data. This entirely has to do with when our subset is a true view of underlying data vs a copy of the dataset. Unfortunately, the rules that govern when `pandas`

makes a copy vs a view are quite opaque and require in-depth knowledge of to even begin making sense. In the example above, we can follow the same view & copy rules numpy uses because the underlying data in our DataFrame can be represented densely as a singly dtype’d `numpy.array`

. However with real world data, these same rules typically cannot be followed because of the complexity that is brought in when dealing with heterogeneously dtyped columns.

Outside of simple cases, it’s very hard to predict whether it [

`df.__getitem__`

] will return a view or a copy (it depends on the memory layout of the array, about which pandas makes no guarantees)…

## Circumventing SettingWithCopyWarning#

Dealing with `SettingWithCopyWarning`

’s has had a long history in `pandas`

. The core developers have a proposal to change the behavior of `__getitem__`

to enforce a more consistent user experience. By returning a new `Series`

or `DataFrame`

from `__getitem__`

and implementing a *copy on write* policy (meaning underlying data is not copied until one attempts to overwrite/update it), this proposal hopes to smooth the user experience when using `pandas`

by nearly eliminating the idea that `pandas`

uses views or copies at all.

When implemented, proposal will be completely backwards incompatible with previous `pandas`

code. Thankfully, there are conceptual approaches we can use that eliminate the need to ever encounter the `SettingWithCopyWarning`

.

First up, we have the idea of working in masks instead of with subsets. When we work with masks, we avoid subsetting our data entirely until necessary. This avoids creating intermediate copies/views of data to remove ambiguity from our code.

### Working With Masks#

```
df = DataFrame(arange(12).reshape(4, 3), index=[*'abcd'], columns=[*'xyz'])
# Pre-compute boolean masks of subset interest
x_lt_5 = df['x'] < 5
y_eq_7 = df['y'] == 7
# Perform the subsetting only when we need to operate
df.loc[x_lt_5 | y_eq_7, 'z'] = 99
print(df)
```

```
x y z
a 0 1 99
b 3 4 99
c 6 7 99
d 9 10 11
```

When we work with masks in this fashion, we acknowledge that we’re not interested in updating independent subsets of data, but dealing with the dataset as a whole. Whenever we do need a subset, we simply combine and apply our masks as needed to perform data manipulations.

### Defensive .copy#

However we often do want to work with our data as separate subsets. In this case, how else can I reduce ambiguity? Well, you can tell pandas pretty explicitly that you want to work with a `.copy`

and not a view.

```
df = DataFrame(arange(12).reshape(4, 3), index=[*'abcd'], columns=[*'xyz'])
# Pre-compute boolean masks of subset interest
subset_df = df.loc[df['x'] < 5].copy()
# Perform the subsetting only when we need to operate
subset_df['y'] = 99
print(subset_df, df, sep=hline)
```

```
x y z
a 0 99 2
b 3 99 5
────────────────────────────────────────
x y z
a 0 1 2
b 3 4 5
c 6 7 8
d 9 10 11
```

You can see the above code successfully created a subset, assigned new values to that subset, avoided propagating the change to the broader data set, and did **not** raise the `SettingWithCopyWarning`

. I guess *explicit is better than implicit* after all.

## Wrap Up#

That’s all for this week, thanks for reading through and I hope you can avoid the dreaded `SettingWithCopyWarning`

in your code! Keep an eye on that proposal too, as it might be acted on sooner than you think. Talk to you all next week.