Reputation: 2866
I have a dataset with five columns.
Dataset:
Country Population Tourism Mean_Age Employed
Afghanistan 37172386 14000 17.3 Fulltime
Albania 2866376 5340000 36.2 Parttime
There are almost 1000 data like this where Employed
is a categorical column. I want to represent the Employed
column as a numerical column using one hot encoding.
My code is
from sklearn.preprocessing import OneHotEncoder
Employed_Status = data["Employed"]
encoder = OneHotEncoder()
encoder.fit(Employed_Status.values.reshape(-1, 1))
encoder.transform(Employed_Status.head().values.reshape(-1, 1)).todense()
Here data
is the name of my data frame.
When I try to see the dataset after executing above lines I got the previous data set.
However, I thought I would get something like that
Country Population Tourism Mean_Age Employed
Afghanistan 37172386 14000 17.3 1
Albania 2866376 5340000 36.2 0
As I have applied one hot encoding on Employed
column.
Can any one tell me why I got the same result and not the desired one?
Upvotes: 0
Views: 927
Reputation: 758
You're not saving the output.
out = encoder.transform(...).todense()
data['employed'] = out
It may take some wrangling to get the datasets to go together. I have found pd.concat(numerical_in, categorical_encoded_in, axis=1)
is needed in the past but you might simply find it works once you save the dense output.
Upvotes: 0
Reputation: 1103
You can do something like this:
data['Employed'] = data['Employed'].replace('Fulltime',1).replace('Parttime',0)
Upvotes: 1