Reputation: 807
I am new to python and sklearn. I have a pandas data frame of titanic dataset. I want it to use for sklearn logistic prediction.
I tried the following
data_np = data.astype(np.int32).values
But not working. I want to make use of different features in the dataset like 'Pclass', "Age", 'Sex' etc ...
I want to convert the entire data , as well as single columns say data["Age"] to sklearn numpy format . Any help .
Upvotes: 4
Views: 6464
Reputation: 1197
This is a common problem. The main reason is lack of familiarity with numpy.
To convert the features of data['Sex'] into bumpy array use the following code.
from sklearn.preprocessing import LabelEncoder
enc = LabelEncoder()
label_encoder = enc.fit(p_train['Sex'])
print "Categorical classes:", label_encoder.classes_
integer_classes = label_encoder.transform(label_encoder.classes_)
print "Integer classes:", integer_classes
x_train = label_encoder.transform(p_train['Sex'])
x_test = label_encoder.transform(p_test['Sex'])
x_train = x_train[:,np.newaxis]
x_test = x_test[:,np.newaxis]
Here, we are basically converting 'male' and 'female' categorical data into integer classes of 0 and 1. This is essentials sclera expects everything to be float. The np.newaxis is used to convert the shape of x_train from (n_features,) to (n_features,1). Otherwise while fitting the model you will have another error of incompatible shapes.
Upvotes: 3
Reputation: 40149
Categorical variables like 'Sex' and 'Embarked' need to be one-hot-encoded to be able to use them in a LogisticRegression
model. With pandas you can use the get_dummies(data['Sex'])
.
There is a full tutorial that covers specifically this issue on the same dataset here:
Upvotes: 4
Reputation: 14498
To process your numerical and non-numerical data, consider using scikit-learn LabelEncoder, which allows you to
Encode labels with value between 0 and n_classes-1.
See also:
https://stackoverflow.com/a/29187634/1569064
Upvotes: 2