Reputation: 7335
I have two dataframes and I wish to update column in one based on the another. The problem is that when I update the column, the old dataframe gets rewritten as well.
(One dataframe contains correlation between column and target variable, the other is supposed to show the ranking)
import numpy as np
import pandas as pd
from scipy.stats import pearsonr
from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data[:100]
y = iris.target[:100]
clmns = iris.feature_names
out = pd.DataFrame(index=np.arange(0,len(clmns)), columns=['coef'])
feat_coef = pd.DataFrame(columns=['Feature_name','pearson_koef_FM'])
feat_coef['Feature_name'] = clmns
feat_rank = feat_coef
X_np = np.array(X)
y_np = np.array(y)
for idx,name in enumerate(clmns):
out['coef'].loc[idx] = pearsonr(X_np[:,idx], y_np)[0]
feat_coef['pearson_koef_FM'] = np.absolute(out['coef'])
print '----BEFORE----'
print feat_coef
feat_rank['pearson_koef_FM'] = feat_coef['pearson_koef_FM'].rank(ascending=False)
print '----AFTER----'
print feat_coef
Which returns this:
----BEFORE----
Feature_name pearson_koef_FM
0 sepal length (cm) 0.72829
1 sepal width (cm) 0.684019
2 petal length (cm) 0.969955
3 petal width (cm) 0.960158
----AFTER----
Feature_name pearson_koef_FM
0 sepal length (cm) 3.0
1 sepal width (cm) 4.0
2 petal length (cm) 1.0
3 petal width (cm) 2.0
Obviously, I expect the feat_coef
remain unchanged. If I print feat_rank
, I get correct output. I feel like it has something to do with setting a copy vs view when copying dataframes.
Upvotes: 0
Views: 99
Reputation: 210832
After this line:
feat_rank = feat_coef
feat_rank
is a reference to feat_coef
:
In [9]: feat_rank is feat_coef
Out[9]: True
In [10]: id(feat_rank)
Out[10]: 177476664
In [11]: id(feat_coef)
Out[11]: 177476664
In [12]: id(feat_coef) == id(feat_rank)
Out[12]: True
In [13]: feat_rank['new'] = 100
In [14]: feat_coef
Out[14]:
Feature_name pearson_koef_FM new
0 sepal length (cm) 0.72829 100
1 sepal width (cm) 0.684019 100
2 petal length (cm) 0.969955 100
3 petal width (cm) 0.960158 100
So if you change any existing column (value) in the reference DF feat_rank
- it will be done on the source DF feat_coef
Solution: if you need an independent DF use .copy()
:
feat_rank = feat_coef.copy()
Upvotes: 1