Reputation: 1312
I'm trying to work on the titanic dataset. The data has categorical values, so I used labelEncoder to change the data to numbers, instead of text. Before:
PassengerId Survived Pclass Sex Age SibSp Parch Fare Embarked
0 1 0 3 male 22.00 1 0 7.2500 S
1 2 1 1 female 38.00 1 0 71.2833 C
2 3 1 3 female 26.00 0 0 7.9250 S
After:
PassengerId Survived Pclass Sex Age SibSp Parch Fare Embarked
0 1 0 3 1 22.00 1 0 7.2500 2
1 2 1 1 0 38.00 1 0 71.2833 0
2 3 1 3 0 26.00 0 0 7.9250 2
This is the code:
from sklearn.preprocessing import LabelEncoder
labelencoder_X = LabelEncoder()
data['Embarked'] = labelencoder_X.fit_transform(data['Embarked'])
data['Sex'] = labelencoder_X.fit_transform(data['Sex'])
Now, because the gender of the passenger is with the same importancy, I want to use oneHotEncoder. As I understand, the data should look the following:
PassengerId Survived Pclass Male Female Age SibSp Parch Fare Embarked
0 1 0 3 1 0 22.00 1 0 7.2500 2
1 2 1 1 0 1 38.00 1 0 71.2833 0
2 3 1 3 0 1 26.00 0 0 7.9250 2
How can I write a code to do this? I have tried to work with similar method for oneHotEncoder:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_X = LabelEncoder()
data['Embarked'] = labelencoder_X.fit_transform(data['Embarked'])
data['Sex'] = labelencoder_X.fit_transform(data['Sex'])
onehotencoder = OneHotEncoder()
data['Embarked'] = onehotencoder.fit_transform(data['Embarked'].values.reshape(-1,1))
But it just return the same result. How can I fix it? I'm new with Scikit and ML, I hope I'm doing things correctly.
Upvotes: 1
Views: 2065
Reputation: 1902
This is how you can do it.
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
# Sample data
Sex
0 1
1 0
2 0
3 1
# OneHotEncoder
result = OneHotEncoder().fit_transform(df['Sex'].reshape(-1, 1)).toarray()
# Appending columns
df[['Female', 'Male']] = pd.DataFrame(result, index = df.index)
# Resulting dataframe
df
Sex Female Male
0 1 0.0 1.0
1 0 1.0 0.0
2 0 1.0 0.0
3 1 0.0 1.0
Upvotes: 2