Reputation: 33
I'm trying to fit and transform some data to use later in a model to a Classifier but it's always giving me an error and I don't understand why. Please, can somebody help me?
##stores the function Pipeline with parameters decided above
inputPipe = getPreProcPipe(normIn=normIn, pca=pca, pcaN=pcaN, whiten=whiten)
print inputPipe
print
#print devData[classTrainFeatures].values.astype('float32')
print devData[classTrainFeatures].shape
print type(devData[classTrainFeatures].values)
##fit pipeline to inputs features and types
inputPipe.fit(devData[classTrainFeatures].values.astype('float32'))
##transform inputs X
X_class = inputPipe.transform(devData[classTrainFeatures].values.astype(double))
## Output Y, i.e, 0 or 1 as it is the target
Y_class = devData['gen_target'].values.astype('int')
#print Y_class
Output:
Pipeline(memory=None,
steps=[('pca', PCA(copy=True, iterated_power='auto', n_components=None, random_state=None,
svd_solver='auto', tol=0.0, whiten=False)), ('normPCA', StandardScaler(copy=True, with_mean=True, with_std=True))])
(32583, 2)
<type 'numpy.ndarray'>
Error in the end of code:
ValueError: Input contains NaN, infinity or a value too large for dtype('float32').
Upvotes: 2
Views: 2285
Reputation: 9422
you have to check the data you use ( not the code ) if it contains NaN ( not a number values ), in numpy there is the function .isnan()
( https://docs.scipy.org/doc/numpy/reference/generated/numpy.isnan.html ) for this How to get the indices list of all NaN value in numpy array?
also check for infinite values with .isinf()
in this kaggle kernel is example code for filling NaNs and Infs in datasets that then are used in classifiers https://www.kaggle.com/mknorps/titanic-with-decision-trees , also see https://datascience.stackexchange.com/questions/25924/difference-between-interpolate-and-fillna-in-pandas?rq=1 for interpolate()
dropping rows that contain NaNs and Infs is done by
indx = devData[classTrainFeatures].index[devData[classTrainFeatures].apply(np.isnan)]
devData=devData.drop(devData.index[indx]).copy()
devData=devData.reset_index(drop=True)
( get index of NaN , drop all rows containing NaN using the index, reset index of dataframe )
Upvotes: 3
Reputation: 1514
I see 3 possibilities for this kind of error:
df.index[np.isinf(df).any(1)]
df.index[np.isnan(df).any(1)]
. In that case you may replace the NaNs with the mean value of the column doing df.fillna(df.mean()).dropna(axis=1, how='all')
.Upvotes: 1