Sara Marques
Sara Marques

Reputation: 33

Python SKLearn fit Value Error Input

I'm trying to fit and transform some data to use later in a model to a Classifier but it's always giving me an error and I don't understand why. Please, can somebody help me?

##stores the function Pipeline with parameters decided above    
inputPipe = getPreProcPipe(normIn=normIn, pca=pca, pcaN=pcaN, whiten=whiten)
print inputPipe
print

#print devData[classTrainFeatures].values.astype('float32')

print devData[classTrainFeatures].shape
print type(devData[classTrainFeatures].values)

##fit pipeline to inputs features and types
inputPipe.fit(devData[classTrainFeatures].values.astype('float32'))

##transform inputs X
X_class = inputPipe.transform(devData[classTrainFeatures].values.astype(double))
## Output Y, i.e, 0 or 1 as it is the target
Y_class = devData['gen_target'].values.astype('int')
#print Y_class

Output:

Pipeline(memory=None,
 steps=[('pca', PCA(copy=True, iterated_power='auto', n_components=None, random_state=None,
svd_solver='auto', tol=0.0, whiten=False)), ('normPCA', StandardScaler(copy=True, with_mean=True, with_std=True))])

(32583, 2)
<type 'numpy.ndarray'>

Error in the end of code:

ValueError: Input contains NaN, infinity or a value too large for dtype('float32').

Code

Error part 1

Error part 2

Upvotes: 2

Views: 2285

Answers (2)

ralf htp
ralf htp

Reputation: 9422

you have to check the data you use ( not the code ) if it contains NaN ( not a number values ), in numpy there is the function .isnan() ( https://docs.scipy.org/doc/numpy/reference/generated/numpy.isnan.html ) for this How to get the indices list of all NaN value in numpy array?

also check for infinite values with .isinf()

in this kaggle kernel is example code for filling NaNs and Infs in datasets that then are used in classifiers https://www.kaggle.com/mknorps/titanic-with-decision-trees , also see https://datascience.stackexchange.com/questions/25924/difference-between-interpolate-and-fillna-in-pandas?rq=1 for interpolate()

dropping rows that contain NaNs and Infs is done by

indx = devData[classTrainFeatures].index[devData[classTrainFeatures].apply(np.isnan)]
devData=devData.drop(devData.index[indx]).copy()
devData=devData.reset_index(drop=True)

( get index of NaN , drop all rows containing NaN using the index, reset index of dataframe )

Upvotes: 3

Gabriel M
Gabriel M

Reputation: 1514

I see 3 possibilities for this kind of error:

  1. You may have Infs in your data. In that case you may need to remove those samples. To find the Infs try. df.index[np.isinf(df).any(1)]
  2. You may have NaNs in yout data. Check it using df.index[np.isnan(df).any(1)]. In that case you may replace the NaNs with the mean value of the column doing df.fillna(df.mean()).dropna(axis=1, how='all') .
  3. Finally but most probably, is that you have a constant or almost constant feature that, once it gets normalized and divided by the standard deviation gives you NaNs or Infs. In that case you should drop that feature using VarianceThreshold

Upvotes: 1

Related Questions