Reputation: 693
I'm writing a class that does one hot encoding, but it doesn't work as I expected.
On my main code I have this:
for col in train_x_categorical.columns:
dataCleaner.addFeatureToBeOneHotEncoded(col)
dataCleaner.applyOneHotEncoding(train_x_categorical)
train_x_categorical.head()
The class method is the following:
def addFeatureToBeOneHotEncoded(self, featureName):
self._featuresToBeOneHotEncoded.append(featureName)
def applyOneHotEncoding(self, data):
for feature in self._featuresToBeOneHotEncoded:
dummies = pd.get_dummies(data[feature])
dummies.drop(dummies.columns[-1],axis=1,inplace=True)
data.drop(feature, axis=1, inplace=True)
data = pd.concat([data, dummies], axis=1)
print(data.columns)
Now, with print(data.columns)
I can see that the method works correctly, but when train_x_categorical.head()
runs I can't see the effect of the method applyOneHotEncoding
.
I don't understand why this is happening and how to fix it.
I thought that since python passes values by reference, the variable data
points to the same object as the variable train_x_categorical
, so in the method applyOneHotEncoding
I was working on the same object, but clearly I am wrong.
Can someone explain to me why my reasoning is wrong and how I can solve the problem?
Upvotes: 0
Views: 140
Reputation: 3720
It is because applyOneHotEncoding
updates the reference variable - data
. That doesn't work the way you think it does. This is a well-known feature in Python. There are a couple of ways around this that I know of - one is to have your method return the value. That won't work in your case since you are doing this as part of a loop. The other option is to put the variable to be updated in a wrapper class and pass that to the method. Then updating the variable that is part of the wrapper class will work.
See this for an exhaustive discussion: How do I pass a variable by reference?
Upvotes: 1