Seja Nair
Seja Nair

Reputation: 807

Convert Pandas Dataframe to numpy for sklearn

I am new to python and sklearn. I have a pandas data frame of titanic dataset. I want it to use for sklearn logistic prediction.

I tried the following

data_np = data.astype(np.int32).values

But not working. I want to make use of different features in the dataset like 'Pclass', "Age", 'Sex' etc ...

I want to convert the entire data , as well as single columns say data["Age"] to sklearn numpy format . Any help .

Upvotes: 4

Views: 6464

Answers (3)

user3116355
user3116355

Reputation: 1197

This is a common problem. The main reason is lack of familiarity with numpy.

To convert the features of data['Sex'] into bumpy array use the following code.

from sklearn.preprocessing import LabelEncoder

enc = LabelEncoder()
label_encoder = enc.fit(p_train['Sex'])
print "Categorical classes:", label_encoder.classes_
integer_classes = label_encoder.transform(label_encoder.classes_)
print "Integer classes:", integer_classes
x_train = label_encoder.transform(p_train['Sex'])
x_test = label_encoder.transform(p_test['Sex'])

x_train = x_train[:,np.newaxis]
x_test = x_test[:,np.newaxis]

Here, we are basically converting 'male' and 'female' categorical data into integer classes of 0 and 1. This is essentials sclera expects everything to be float. The np.newaxis is used to convert the shape of x_train from (n_features,) to (n_features,1). Otherwise while fitting the model you will have another error of incompatible shapes.

Upvotes: 3

ogrisel
ogrisel

Reputation: 40149

Categorical variables like 'Sex' and 'Embarked' need to be one-hot-encoded to be able to use them in a LogisticRegression model. With pandas you can use the get_dummies(data['Sex']).

There is a full tutorial that covers specifically this issue on the same dataset here:

http://nbviewer.ipython.org/github/ogrisel/parallel_ml_tutorial/blob/master/rendered_notebooks/04%20-%20Pandas%20and%20Heterogeneous%20Data%20Modeling.ipynb

Upvotes: 4

AGS
AGS

Reputation: 14498

To process your numerical and non-numerical data, consider using scikit-learn LabelEncoder, which allows you to

Encode labels with value between 0 and n_classes-1.

See also:

https://stackoverflow.com/a/29187634/1569064

Upvotes: 2

Related Questions