Eliza Romanski
Eliza Romanski

Reputation: 93

Cross validation in random forest using anaconda

I'm using the titanic data set to predict if a passenger survived or not using random forest. This is my code:

import numpy as np 
import pandas as pd 
from sklearn.ensemble import RandomForestClassifier
from sklearn import cross_validation
import matplotlib.pyplot as plt
%matplotlib inline

data=pd.read_csv("C:\\Users\\kabala\\Downloads\\Titanic.csv")
data.isnull().any()
data["Age"]=data1["Age"].fillna(data1["Age"].median())
data["PClass"]=data["PClass"].fillna("3rd")
data["PClass"].isnull().any()
data1.isnull().any()
pd.get_dummies(data.Sex)
# choosing the predictive variables 
x=data[["PClass","Age","Sex"]]
# the target variable is y 
y=data["Survived"]
modelrandom=RandomForestClassifier(max_depth=3)
modelrandom=cross_validation.cross_val_score(modelrandom,x,y,cv=5)

But, I keep on getting this error:

ValueError: could not convert string to float: 'female'

and I don't understand what is the problem because I changed the Sex feature to a dummy

Thanks:)

Upvotes: 0

Views: 98

Answers (1)

jtweeder
jtweeder

Reputation: 759

pd.get_dummies returns a data frame, and does not do the operation in place. Therefore you really are sending a sting with the sex column.

So you would need something like X = pd.get_dummies(data[['Sex','PClass','Age']], columns=['Sex','PClass']) and this should fix your problem. I think PClass will also be a string column you need to use dummy variables, as you have it filling '3rd'.

There are still some more places where you call data.isnull().any() that is not doing anything to the underlying dataframe. I left those as they were, but just FYI they may not be doing what you intended.

Full code would be:

import numpy as np 
import pandas as pd 
from sklearn.ensemble import RandomForestClassifier
from sklearn import cross_validation
import matplotlib.pyplot as plt
%matplotlib inline

data=pd.read_csv("C:\\Users\\kabala\\Downloads\\Titanic.csv")
data.isnull().any()   <-----Beware this is not doing anything to the data
data["Age"]=data1["Age"].fillna(data1["Age"].median())
data["PClass"]=data["PClass"].fillna("3rd")
data["PClass"].isnull().any()  <-----Beware this is not doing anything to the data
data1.isnull().any() <-----Beware this is not doing anything to the data

#********Fix for your code*******
X = pd.get_dummies(data[['Sex','PClass','Age']], columns=['Sex','PClass'])

# choosing the predictive variables 
# x=data[["PClass","Age","Sex"]]
# the target variable is y 
y=data["Survived"]
modelrandom=RandomForestClassifier(max_depth=3)
modelrandom=cross_validation.cross_val_score(modelrandom,x,y,cv=5)

Upvotes: 1

Related Questions