BeMyGuestPlease
BeMyGuestPlease

Reputation: 561

Python getting SettingWithCopyWarning - iloc vs. loc - cannot figure out why

I have the basic understanding of SettingWithCopyWarning but I am not able to figure out why I am getting the warning for this particular case.

I am following the code from https://github.com/ageron/handson-ml/blob/master/02_end_to_end_machine_learning_project.ipynb

When I run the code as below (using .loc), I do not get the SettingWithCopyWarning

However, if I run the code with .iloc instead, I do get the warning.

Can someone help me understand it?

from sklearn.model_selection import StratifiedShuffleSplit

split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)

for train_index, test_index in split.split(housing, housing["income_cat"]):
    strat_train_set = housing.loc[train_index]
    strat_test_set = housing.loc[test_index]

for set_ in (strat_train_set, strat_test_set):
    set_.drop("income_cat", axis=1, inplace=True)

Upvotes: 2

Views: 691

Answers (2)

GZ0
GZ0

Reputation: 4263

I did some exploration and according to my understanding this is what is under the hood of SettingWithCopyWarning: every time when a data frame df is created from another frame df_orig, pandas adopts some heuristics to determine whether the data may be implicitly copied from df_orig, which a less experienced user may not be aware of. If so, the _is_copy field of df is set to a weak reference of df_orig. Later, when an in-place update of df is attempted, pandas will determine whether a SettingWithCopyWarning should be shown based on df._is_copy as well as some other fields of df (note that df._is_copy is not the sole criteria here). However, since some methods are shared among different scenarios the heuristics is not perfect and some cases can be mishandled.

In the code from the post, both housing.loc[train_index] and housing.iloc[train_index] return an implicit copy of the housing data frame.

for df in (housing.loc[train_index], housing.iloc[train_index]):
    print(df._is_view, df._is_copy)

The above check yields the following result:

False None
False <weakref at 0x0000019BFDF37958; to 'DataFrame' at 0x0000019BFDF26550>

Here, _is_view is another field that shows whether an update on df could affect the original data frame housing. A False outcome indicates that the underlying data is already being copied. However, for housing.loc[train_index] the df._is_copy field is not set, which in my opinion should be in this case, resulting a missing SettingWithCopyWarning afterwards when an in-place modification of df is performed by the statement df.drop("income_cat", axis=1, inplace=True).

In order to avoid SettingWithCopyWarning, you need to either (1) perform the in-place update before slicing; (2) if possible, build the update logic into slicing; or (3) make an "explicit" copy of the data after slicing when an in-place update is needed. In your example, approach (1) looks like this:

# Updates the housing data frame in-place before slicing
income_cat = housing["income_cat"]
housing.drop("income_cat", axis=1, inplace=True)

for train_index, test_index in split.split(housing, income_cat):
    strat_train_set = housing.loc[train_index]
    strat_test_set = housing.loc[test_index]

Approach (2) looks like this:

feature_cols = housing.columns.difference(["income_cat"])
for train_index, test_index in split.split(housing, housing["income_cat"]):
    # Filter columns at the same time as slicing the rows
    strat_train_set = housing.loc[train_index, feature_cols]
    strat_test_set = housing.loc[test_index, feature_cols]

Approach (3) looks like this:

for train_index, test_index in split.split(housing, housing["income_cat"]):
    ...

for set_ in (strat_train_set, strat_test_set):
    # Remove "inplace=True" results in a copy being made
    set_.drop("income_cat", axis=1)

Besides changing the inplace setting of the update method, df.copy() is another method that can be used to make an "explicit" copy. If you intend to change one or more columns of df, use df.assign(col=...) to create a copy rather than df["col"]=....

Upvotes: 1

Danny
Danny

Reputation: 472

The issue here is not because of indexing, iloc and loc would work the same way for you here. The problem is in set_.drop("income_cat", axis=1, inplace=True). It looks like there's a weak reference between the set_ data frame and the strat_train_set and strat_test_set.

for set_ in (strat_train_set, strat_test_set):
         print(set_._is_copy)

With this you get:

<weakref at 0x128b30598; to 'DataFrame' at 0x128b355c0>
<weakref at 0x128b30598; to 'DataFrame' at 0x128b355c0>

This could lead to SettingWithCopyWarning as it's trying to transform the copy of the data frame and applying those change to the original ones as well.

Upvotes: 5

Related Questions