Reputation: 561
I have the basic understanding of SettingWithCopyWarning but I am not able to figure out why I am getting the warning for this particular case.
I am following the code from https://github.com/ageron/handson-ml/blob/master/02_end_to_end_machine_learning_project.ipynb
When I run the code as below (using .loc), I do not get the SettingWithCopyWarning
However, if I run the code with .iloc instead, I do get the warning.
Can someone help me understand it?
from sklearn.model_selection import StratifiedShuffleSplit
split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in split.split(housing, housing["income_cat"]):
strat_train_set = housing.loc[train_index]
strat_test_set = housing.loc[test_index]
for set_ in (strat_train_set, strat_test_set):
set_.drop("income_cat", axis=1, inplace=True)
Upvotes: 2
Views: 691
Reputation: 4263
I did some exploration and according to my understanding this is what is under the hood of SettingWithCopyWarning
: every time when a data frame df
is created from another frame df_orig
, pandas
adopts some heuristics to determine whether the data may be implicitly copied from df_orig
, which a less experienced user may not be aware of. If so, the _is_copy
field of df
is set to a weak reference of df_orig
. Later, when an in-place update of df
is attempted, pandas
will determine whether a SettingWithCopyWarning
should be shown based on df._is_copy
as well as some other fields of df
(note that df._is_copy
is not the sole criteria here). However, since some methods are shared among different scenarios the heuristics is not perfect and some cases can be mishandled.
In the code from the post, both housing.loc[train_index]
and housing.iloc[train_index]
return an implicit copy of the housing
data frame.
for df in (housing.loc[train_index], housing.iloc[train_index]):
print(df._is_view, df._is_copy)
The above check yields the following result:
False None
False <weakref at 0x0000019BFDF37958; to 'DataFrame' at 0x0000019BFDF26550>
Here, _is_view
is another field that shows whether an update on df
could affect the original data frame housing
. A False
outcome indicates that the underlying data is already being copied. However, for housing.loc[train_index]
the df._is_copy
field is not set, which in my opinion should be in this case, resulting a missing SettingWithCopyWarning
afterwards when an in-place modification of df
is performed by the statement df.drop("income_cat", axis=1, inplace=True)
.
In order to avoid SettingWithCopyWarning
, you need to either (1) perform the in-place update before slicing; (2) if possible, build the update logic into slicing; or (3) make an "explicit" copy of the data after slicing when an in-place update is needed. In your example, approach (1) looks like this:
# Updates the housing data frame in-place before slicing
income_cat = housing["income_cat"]
housing.drop("income_cat", axis=1, inplace=True)
for train_index, test_index in split.split(housing, income_cat):
strat_train_set = housing.loc[train_index]
strat_test_set = housing.loc[test_index]
Approach (2) looks like this:
feature_cols = housing.columns.difference(["income_cat"])
for train_index, test_index in split.split(housing, housing["income_cat"]):
# Filter columns at the same time as slicing the rows
strat_train_set = housing.loc[train_index, feature_cols]
strat_test_set = housing.loc[test_index, feature_cols]
Approach (3) looks like this:
for train_index, test_index in split.split(housing, housing["income_cat"]):
...
for set_ in (strat_train_set, strat_test_set):
# Remove "inplace=True" results in a copy being made
set_.drop("income_cat", axis=1)
Besides changing the inplace
setting of the update method, df.copy()
is another method that can be used to make an "explicit" copy. If you intend to change one or more columns of df
, use df.assign(col=...)
to create a copy rather than df["col"]=...
.
Upvotes: 1
Reputation: 472
The issue here is not because of indexing, iloc
and loc
would work the same way for you here. The problem is in set_.drop("income_cat", axis=1, inplace=True)
. It looks like there's a weak reference between the set_
data frame and the strat_train_set
and strat_test_set
.
for set_ in (strat_train_set, strat_test_set):
print(set_._is_copy)
With this you get:
<weakref at 0x128b30598; to 'DataFrame' at 0x128b355c0>
<weakref at 0x128b30598; to 'DataFrame' at 0x128b355c0>
This could lead to SettingWithCopyWarning
as it's trying to transform the copy of the data frame and applying those change to the original ones as well.
Upvotes: 5