Reputation: 6483
I one hot encoded some variable and after some computation I would like to retrieve the original one.
What I am doing is the following:
I filter the one hot encoded column names (they all start with the name of the original variable, let say 'mycol'
)
filter_col = [col for col in df if col.startswith('mycol')]
Then I can simply multiply the column names by the filtered variables.
X_test[filter_col]*filter_col
However, this leads to a sparse matrix. How do I create one single variable out of this? Summing doesn't work as the empty spaces are treated as numbers and doing this: sum(X_test[filter_col]*filter_col)
I get
TypeError: unsupported operand type(s) for +: 'int' and 'str'
Any suggestion on how to proceed? Is this even the best approach or is there some function out there doing exactly what I need?
As request, here is an example, taken from here:
df= pd.DataFrame({
'mycol':np.random.choice( ['panda','python','shark'], 10),
})
df=pd.get_dummies(df)
Upvotes: 3
Views: 2311
Reputation: 862581
If need sum values per rows:
(X_test[filter_col]*filter_col).sum(axis=1)
Solution if possible only 0
per rows or multiple 1
per rows:
X_test = pd.DataFrame({
'mycolB':[0,1,1,0],
'mycolC':[0,0,1,0],
'mycolD':[1,0,0,0],
})
filter_col = [col for col in X_test if col.startswith('mycol')]
df = X_test[filter_col].dot(pd.Index(filter_col) + ', ' ).str.strip(', ')
print (df)
0 mycolD
1 mycolB
2 mycolB, mycolC
3
dtype: object
Upvotes: 2
Reputation: 18647
IIUC, you can use DataFrame.idxmax
along axis=1
. If necessary you can replace dummy prefix, with str.replace
:
X_test[filter_col].idxmax(axis=1).str.replace('mycol_', '')
Upvotes: 1