Python SKLearn fit Value Error Input

Question

I'm trying to fit and transform some data to use later in a model to a Classifier but it's always giving me an error and I don't understand why. Please, can somebody help me?

##stores the function Pipeline with parameters decided above    
inputPipe = getPreProcPipe(normIn=normIn, pca=pca, pcaN=pcaN, whiten=whiten)
print inputPipe
print

#print devData[classTrainFeatures].values.astype('float32')

print devData[classTrainFeatures].shape
print type(devData[classTrainFeatures].values)

##fit pipeline to inputs features and types
inputPipe.fit(devData[classTrainFeatures].values.astype('float32'))

##transform inputs X
X_class = inputPipe.transform(devData[classTrainFeatures].values.astype(double))
## Output Y, i.e, 0 or 1 as it is the target
Y_class = devData['gen_target'].values.astype('int')
#print Y_class

Output:

Pipeline(memory=None,
 steps=[('pca', PCA(copy=True, iterated_power='auto', n_components=None, random_state=None,
svd_solver='auto', tol=0.0, whiten=False)), ('normPCA', StandardScaler(copy=True, with_mean=True, with_std=True))])

(32583, 2)

Error in the end of code:

ValueError: Input contains NaN, infinity or a value too large for dtype('float32').

Code

Error part 1

Error part 2

ralf htp · Accepted Answer

you have to check the data you use ( not the code ) if it contains NaN ( not a number values ), in numpy there is the function .isnan() ( https://docs.scipy.org/doc/numpy/reference/generated/numpy.isnan.html ) for this How to get the indices list of all NaN value in numpy array?

also check for infinite values with .isinf()

in this kaggle kernel is example code for filling NaNs and Infs in datasets that then are used in classifiers https://www.kaggle.com/mknorps/titanic-with-decision-trees , also see https://datascience.stackexchange.com/questions/25924/difference-between-interpolate-and-fillna-in-pandas?rq=1 for interpolate()

dropping rows that contain NaNs and Infs is done by

indx = devData[classTrainFeatures].index[devData[classTrainFeatures].apply(np.isnan)]
devData=devData.drop(devData.index[indx]).copy()
devData=devData.reset_index(drop=True)

( get index of NaN , drop all rows containing NaN using the index, reset index of dataframe )

Python SKLearn fit Value Error Input

Answers (2)

Related Questions