Reputation: 365
Hello I am following a video on Udemy. We are trying to apply a random forest classifier. Before we do so, we convert one of the columns in a data frame into a string. The 'Cabin' column represents values such as "4C" but in order to reduce the number of unique values, we want to use simply the first number to map onto a new column 'Cabin_mapped'.
data['Cabin_mapped'] = data['Cabin'].astype(str).str[0]
# this transforms the letters into numbers
cabin_dict = {k:i for i, k in enumerate(
data['Cabin_mapped'].unique(),0)}
data.loc[:,'Cabin_mapped'] = data.loc[:,'Cabin_mapped'].map(cabin_dict)
data[['Cabin_mapped', 'Cabin']].head()
This part below is simply splitting the data into training and test set. The parameters don't really matter for figuring out the problem.
X_train_less_cat, X_test_less_cat, y_train, y_test = \
train_test_split(data[use_cols].fillna(0), data.Survived,
test_size = 0.3, random_state=0)
I get an error here after the fit, saying I could not convert the string into a float. rf = RandomForestClassifier(n_estimators=200, random_state=39) rf.fit(X_train_less_cat, y_train)
It seems like I need to convert one of the inputs back into float to use the random forest algorithms. This is despite the error not showing up in the tutorial video. If anyone could help me out, that'd be great.
Upvotes: 1
Views: 5167
Reputation: 2636
here's fully working example - I've highlighted the bit that you are missing. You need to convert EVERY column to a number, not just "cabin".
!wget https://raw.githubusercontent.com/agconti/kaggle-titanic/master/data/train.csv
import pandas as pd
data = pd.read_csv("train.csv")
data['Cabin_mapped'] = data['Cabin'].astype(str).str[0]
# this transforms the letters into numbers
cabin_dict = {k:i for i, k in enumerate(
data['Cabin_mapped'].unique(),0)}
data.loc[:,'Cabin_mapped'] = data.loc[:,'Cabin_mapped'].map(cabin_dict)
data[['Cabin_mapped', 'Cabin']].head()
from sklearn.ensemble import RandomForestClassifier
from sklearn.cross_validation import train_test_split
## YOU ARE MISSING THIS BIT, some of your columns are still strings
## they need to be converted to numbers (ints OR floats)
for n,v in data.items():
if v.dtype == "object":
data[n] = v.factorize()[0]
## END of the bit you're missing
use_cols = data.drop("Survived",axis=1).columns
X_train_less_cat, X_test_less_cat, y_train, y_test = \
train_test_split(data[use_cols].fillna(0), data.Survived,
test_size = 0.3, random_state=0)
rf = RandomForestClassifier(n_estimators=200, random_state=39)
rf.fit(X_train_less_cat, y_train)
Upvotes: 1