On copies and views: getting rid of the SettingWithCopyWarning
Pandas' current behavior on whether indexing returns a view or copy is confusing, even for experienced users. But it doesn’t have to be this way. We can make this aspect of pandas easier to grasp by simplifying the copy/view rules, and at the same time make pandas more memory-efficient. And get rid of the SettingWithCopyWarning.
Context¶
The infamous "SettingWithCopyWarning" has probably confused / annoyed / enraged (delete what does not apply for you) many users of pandas. This can also be seen by the many lengthy blogposts on this topic that go into the details on what it is and how to deal with it.
As a quick recap, let's consider the following example where we filter a DataFrame df
to create subset
, and then modify subset
:
>>> df = pd.DataFrame({"A": [1, 2], "B": [3, 4], "C": [5, 6]})
>>> subset = df[df["A"] > 1]
>>> subset["C"] = 10
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
Does this last line only change subset
, or also df
?
In general, it is not always clear when a new object (subset
in the example above) is a view on the original data (df
) or a copy. With a view, we mean the new object isn't a copy of the original data but is "viewing" the same data in memory. That means that when you edit this view, you are actually updating the original data as well.
Even expert pandas developers (let alone new users) probably won't be able to correctly predict this in all cases whether a certain operation will result in a view or a copy. And because this is not always clear, pandas warns users with the SettingWithCopyWarning about potential unexpected behaviour.
A proposal for a simpler behaviour¶
Therefore, here is a new proposal to simplify this situation and move towards a single rule: any DataFrame or Series derived from another in any way (e.g. with an indexing operation) always behaves as a copy.
A single rule, but one that we can phraze in different ways. For example, an implication of this is: mutating a DataFrame only changes the object itself, and not any other. Or, put differently in another way: if you want to change values in a DataFrame or Series, you can only do that by directly mutating the DataFrame/Series at hand.
So with this in mind, we can go back to the original example:
>>> df = pd.DataFrame({"A": [1, 2], "B": [3, 4], "C": [5, 6]})
>>> subset = df[df["A"] > 1]
>>> subset["C"] = 10
and try to answer the question again: will mutating subset
also modify df
? Following this proposed new rule, we can say: since subset
is not the same object as df
, mutating subset
will not change df
.
Several advantages¶
This proposal has several advantages:
- A simpler, more consistent user experience
- We can get rid of the SettingWithCopyWarning (since there is no confusion about whether we are mutating a view or a copy)
- We would no longer need defensive copying in many places in pandas, improving memory usage (using "Copy-on-Write")
I will go into more detail on those different aspects in follow-up blog posts.
Important! This is only a proposal, and not yet reality in pandas. Does this sound interesting? Then you can read the full proposal here, and feedback is very welcome!