NimbleTortoise
NimbleTortoise

Reputation: 365

ValueError: could not convert string to float: 'n'

Hello I am following a video on Udemy. We are trying to apply a random forest classifier. Before we do so, we convert one of the columns in a data frame into a string. The 'Cabin' column represents values such as "4C" but in order to reduce the number of unique values, we want to use simply the first number to map onto a new column 'Cabin_mapped'.

enter image description here

data['Cabin_mapped'] = data['Cabin'].astype(str).str[0]
# this transforms the letters into numbers
cabin_dict = {k:i for i, k in enumerate(
    data['Cabin_mapped'].unique(),0)}

data.loc[:,'Cabin_mapped'] =  data.loc[:,'Cabin_mapped'].map(cabin_dict)

data[['Cabin_mapped', 'Cabin']].head() 

This part below is simply splitting the data into training and test set. The parameters don't really matter for figuring out the problem.

X_train_less_cat, X_test_less_cat, y_train, y_test = \
    train_test_split(data[use_cols].fillna(0), data.Survived, 
                     test_size = 0.3, random_state=0) 

I get an error here after the fit, saying I could not convert the string into a float. rf = RandomForestClassifier(n_estimators=200, random_state=39) rf.fit(X_train_less_cat, y_train)

It seems like I need to convert one of the inputs back into float to use the random forest algorithms. This is despite the error not showing up in the tutorial video. If anyone could help me out, that'd be great.

Upvotes: 1

Views: 5167

Answers (1)

avloss
avloss

Reputation: 2636

here's fully working example - I've highlighted the bit that you are missing. You need to convert EVERY column to a number, not just "cabin".

!wget https://raw.githubusercontent.com/agconti/kaggle-titanic/master/data/train.csv

import pandas as pd

data = pd.read_csv("train.csv")




data['Cabin_mapped'] = data['Cabin'].astype(str).str[0]
# this transforms the letters into numbers
cabin_dict = {k:i for i, k in enumerate(
    data['Cabin_mapped'].unique(),0)}

data.loc[:,'Cabin_mapped'] =  data.loc[:,'Cabin_mapped'].map(cabin_dict)

data[['Cabin_mapped', 'Cabin']].head()


from sklearn.ensemble import RandomForestClassifier
from sklearn.cross_validation import train_test_split


## YOU ARE MISSING THIS BIT, some of your columns are still strings
## they need to be converted to numbers (ints OR floats)
for n,v in data.items():
    if v.dtype == "object":
        data[n] = v.factorize()[0]
## END of the bit you're missing

use_cols = data.drop("Survived",axis=1).columns

X_train_less_cat, X_test_less_cat, y_train, y_test = \
    train_test_split(data[use_cols].fillna(0), data.Survived, 
                    test_size = 0.3, random_state=0) 


rf = RandomForestClassifier(n_estimators=200, random_state=39)
rf.fit(X_train_less_cat, y_train)

Upvotes: 1

Related Questions