Reputation: 7721

Error in scikit code

I am new to Machine Learning and am trying the titanic problem from Kaggle. I have written the attached code that uses decision tree to do computations on data. There is an error that I am unable to remove.

Code :

#!/usr/bin/env python

from __future__ import print_function
import pandas as pd
import numpy as np
from sklearn import tree


train_uri = './titanic/train.csv' 
test_uri = './titanic/test.csv'

train = pd.read_csv(train_uri)
test = pd.read_csv(test_uri)

# print(train[train["Sex"] == 'female']["Survived"].value_counts(normalize=True))

train['Child'] = float('NaN')
train['Child'][train['Age'] < 18] = 1
train['Child'][train['Age'] >= 18] = 0

# print(train[train['Child'] == 1]['Survived'].value_counts(normalize=True))
# print(train['Embarked'][train['Embarked'] == 'C'].value_counts())
# print(train.shape)

## Fill empty 'Embarked' values with 'S'
train['Embarked'] = train['Embarked'].fillna('S')

## Convert Embarked classes to integers
train["Embarked"][train["Embarked"] == "S"] = 0
train['Embarked'][train['Embarked'] == "C"] = 1
train['Embarked'][train['Embarked'] == "Q"] = 2

train['Sex'][train['Sex'] == 'male'] = 0
train['Sex'][train['Sex'] == 'female'] = 1

target = train['Survived'].values
features_a = train[['Pclass', 'Sex', 'Age', 'Fare']].values

tree_a = tree.DecisionTreeClassifier()

#####  Line With Error ##### 
tree_a = tree_a.fit(features_a, target)

# print(tree_a.feature_importances_)
# print(tree_a.score(features_a, target))

Error:

Traceback (most recent call last):
  File "titanic.py", line 40, in <module>
    tree_a = tree_a.fit(features_a, target)
  File "/usr/local/lib/python2.7/dist-packages/sklearn/tree/tree.py", line 739, in fit
    X_idx_sorted=X_idx_sorted)
  File "/usr/local/lib/python2.7/dist-packages/sklearn/tree/tree.py", line 122, in fit
    X = check_array(X, dtype=DTYPE, accept_sparse="csc")
  File "/usr/local/lib/python2.7/dist-packages/sklearn/utils/validation.py", line 407, in check_array
    _assert_all_finite(array)
  File "/usr/local/lib/python2.7/dist-packages/sklearn/utils/validation.py", line 58, in _assert_all_finite
    " or a value too large for %r." % X.dtype)
ValueError: Input contains NaN, infinity or a value too large for dtype('float32').

This error isn't present when I run the code on Datacamp server but present when I run it locally. I don't understand why this is coming up, I have checked the data and the values in either features_a or target don't contain NaN or really high values.

Upvotes: 0

Answers (2)

sansingh

Reputation: 195

You can also try dropna() function of pandas to drop all those rows from dataset which have invalid values like NaN.

Upvotes: 1

simon

Reputation: 2841

Try each feature one by one and you will probably find one of them has some nulls. I note you do not check if sex has nulls.

Also by coding each categoric variable manually it would be easy to make an error perhaps by misspelling one of the categories. Instead you can use df=pd.get_dummies(df) and it will automatically code all the categoric variables for you. No need to specify each category manually.

Upvotes: 1

Error in scikit code

Answers (2)

Related Questions