HonzaB
HonzaB

Reputation: 7335

Pandas seting column in new dataframe replace old dataframe

I have two dataframes and I wish to update column in one based on the another. The problem is that when I update the column, the old dataframe gets rewritten as well.

(One dataframe contains correlation between column and target variable, the other is supposed to show the ranking)

import numpy as np
import pandas as pd
from scipy.stats import pearsonr
from sklearn.datasets import load_iris

iris = load_iris()
X = iris.data[:100]
y = iris.target[:100]
clmns = iris.feature_names

out = pd.DataFrame(index=np.arange(0,len(clmns)), columns=['coef'])

feat_coef = pd.DataFrame(columns=['Feature_name','pearson_koef_FM']) 

feat_coef['Feature_name'] = clmns
feat_rank = feat_coef

X_np = np.array(X)
y_np = np.array(y)
for idx,name in enumerate(clmns):
    out['coef'].loc[idx] = pearsonr(X_np[:,idx], y_np)[0]

feat_coef['pearson_koef_FM'] = np.absolute(out['coef'])

print '----BEFORE----'      
print feat_coef

feat_rank['pearson_koef_FM'] = feat_coef['pearson_koef_FM'].rank(ascending=False)

print '----AFTER----'     
print feat_coef

Which returns this:

----BEFORE----
        Feature_name pearson_koef_FM
0  sepal length (cm)         0.72829
1   sepal width (cm)        0.684019
2  petal length (cm)        0.969955
3   petal width (cm)        0.960158
----AFTER----
        Feature_name  pearson_koef_FM
0  sepal length (cm)              3.0
1   sepal width (cm)              4.0
2  petal length (cm)              1.0
3   petal width (cm)              2.0

Obviously, I expect the feat_coef remain unchanged. If I print feat_rank, I get correct output. I feel like it has something to do with setting a copy vs view when copying dataframes.

Upvotes: 0

Views: 99

Answers (1)

MaxU - stand with Ukraine
MaxU - stand with Ukraine

Reputation: 210832

After this line:

feat_rank = feat_coef

feat_rank is a reference to feat_coef:

In [9]: feat_rank is feat_coef
Out[9]: True

In [10]: id(feat_rank)
Out[10]: 177476664

In [11]: id(feat_coef)
Out[11]: 177476664

In [12]: id(feat_coef) == id(feat_rank)
Out[12]: True

In [13]: feat_rank['new'] = 100

In [14]: feat_coef
Out[14]:
        Feature_name pearson_koef_FM  new
0  sepal length (cm)         0.72829  100
1   sepal width (cm)        0.684019  100
2  petal length (cm)        0.969955  100
3   petal width (cm)        0.960158  100

So if you change any existing column (value) in the reference DF feat_rank - it will be done on the source DF feat_coef

Solution: if you need an independent DF use .copy():

feat_rank = feat_coef.copy()

Upvotes: 1

Related Questions