Reputation: 463
I have gone through all the similar questions but none of them answer my query. I am using random forest classifier as follows:
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(n_estimators=100, max_depth=2, random_state=0)
clf.fit(X_train, y_train)
clf.predict(X_test)
It's giving me this error:
ValueError: Input contains NaN, infinity or a value too large for dtype('float32').
However, when I do X_train.describe()
I don't see any missing values. In fact, actually, I already took care of the missing values before even splitting my data.
When I do the following:
np.where(X_train.values >= np.finfo(np.float32).max)
I get:
(array([], dtype=int64), array([], dtype=int64))
And for these commands:
np.any(np.isnan(X_train)) #true
np.all(np.isfinite(X_train)) #false
And after getting the above results, I also tried this:
X_train.fillna(X_train.mean())
but I get the same error and it doesn't fix anything.
Please tell me where I'm going wrong. Thank you!
Upvotes: 5
Views: 12608
Reputation: 100
SolutionX_train = X_train.fillna(X_train.mean())
Explanation
np.any(np.isnan(X_train))
evals to True
, therefore X_train
contains some nan
values.
Per pandas fillna() docs, DataFrame.fillna() returns a copy of the DataFrame with missing values filled. You must reassign X_train to the return value of fillna(), like X_train = X_train.fillna(X_train.mean())
Example
>>> import pandas as pd
>>> import numpy as np
>>>
>>> a = pd.DataFrame(np.arange(25).reshape(5, 5))
>>> a[2][2] = np.nan
>>>
>>> a
0 1 2 3 4
0 0 1 2.0 3 4
1 5 6 7.0 8 9
2 10 11 NaN 13 14
3 15 16 17.0 18 19
4 20 21 22.0 23 24
>>>
>>> a.fillna(1)
0 1 2 3 4
0 0 1 2.0 3 4
1 5 6 7.0 8 9
2 10 11 1.0 13 14
3 15 16 17.0 18 19
4 20 21 22.0 23 24
>>>
>>> a
0 1 2 3 4
0 0 1 2.0 3 4
1 5 6 7.0 8 9
2 10 11 NaN 13 14
3 15 16 17.0 18 19
4 20 21 22.0 23 24
>>>
>>> a = a.fillna(1)
>>> a
0 1 2 3 4
0 0 1 2.0 3 4
1 5 6 7.0 8 9
2 10 11 1.0 13 14
3 15 16 17.0 18 19
4 20 21 22.0 23 24
>>>
Upvotes: 1