R user
R user

Reputation: 131

Python - Error with scikit learn Random Forest about values format

When I execute the command:

clf.fit(train_data, train_label)

I'm obtaining the following error

ValueError: Input contains NaN, infinity or a value too large for dtype('float32').

The problem is the array train_data with size (18000,20). I've tried to use this command:

clf.fit(np.float32(train_data), train_label)

or

train_data = np.array([s[0].astype('float32') for s in train_data])

Find the datasets train_data and train_label in the train file (python) in the following link:

https://www.dropbox.com/s/b3017gi18x6x325/train?dl=0

However, I cannot get that all the values from the array "train_data" being valid for the clf.fit function. Any help?

Upvotes: 0

Views: 569

Answers (1)

seralouk
seralouk

Reputation: 33127

Just found a solution to overcome this error. You need to scale the data:

Code:

from sklearn.ensemble import RandomForestClassifier
import pickle
import numpy as np
from sklearn.preprocessing import scale

with open('train', 'rb') as f: 
    train_data, train_label = pickle.load(f)

#some diagnostic to see if there are NaNs. No NaN were found !
print(np.isnan(train_data))
print(np.where(np.isnan(train_data)))
print(np.nan_to_num(train_data))
print(np.isnan(train_label))
print(np.where(np.isnan(train_label)))

#so need to scale
train_data = scale(train_data)

clf = RandomForestClassifier()
clf.fit(train_data, train_label)

Upvotes: 1

Related Questions