Lara Farron
Lara Farron

Reputation: 75

Model prediction returns warnings

I'm trying to build a model that predicts the probability for an athlete to win a medal. I have a dataframe that looks like this:

enter image description here

Here is what I've already done

#Cleaning df
    #Replace NaN with mean or average
    
df['Height'].fillna(value=df['Height'].mean(), inplace=True)
df['Weight'].fillna(value=df['Weight'].mean(), inplace=True)

    #Changing type to integer
df.Height = df.Height.astype(int)
df.Weight = df.Weight.astype(int)

#Target variable
y= df["Medal"]

#If Male =0, if female = 1
df['Sex'] = df['Sex'].apply(lambda x: 1 if str(x) != 'M' else 0)

#Predictive
feature_names = ["Age", "Sex", "Height", "Weight"]
X= df[feature_names]

#Regressor

from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import cross_val_score
regressor = DecisionTreeRegressor(random_state=0)
cross_val_score(regressor, X, y, cv=10)

But when I run the code, it returns me an error

warnings.warn("Estimator fit failed. The score on this train-test"
C:\Users\miss_\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py:610: FitFailedWarning: Estimator fit failed. The score on this train-test partition for these parameters will be set to nan. Details: 
Traceback (most recent call last):
  File "C:\Users\miss_\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 593, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\miss_\anaconda3\lib\site-packages\sklearn\tree\_classes.py", line 1247, in fit
    super().fit(
  File "C:\Users\miss_\anaconda3\lib\site-packages\sklearn\tree\_classes.py", line 156, in fit
    X, y = self._validate_data(X, y,
  File "C:\Users\miss_\anaconda3\lib\site-packages\sklearn\base.py", line 430, in _validate_data
    X = check_array(X, **check_X_params)
  File "C:\Users\miss_\anaconda3\lib\site-packages\sklearn\utils\validation.py", line 63, in inner_f
    return f(*args, **kwargs)
  File "C:\Users\miss_\anaconda3\lib\site-packages\sklearn\utils\validation.py", line 663, in check_array
    _assert_all_finite(array,
  File "C:\Users\miss_\anaconda3\lib\site-packages\sklearn\utils\validation.py", line 103, in _assert_all_finite
    raise ValueError(
ValueError: Input contains NaN, infinity or a value too large for dtype('float32').

And returns an array like this : array[NaN, NaN, NaN...]

My X looks like this

    Age Sex Height  Weight
0   24.0    1   180 80
1   23.0    1   170 60
2   24.0    1   175 70
3   34.0    1   175 70
4   21.0    1   185 82
... ... ... ... ...
271111  29.0    1   179 89
271112  27.0    1   176 59
271113  27.0    1   176 59
271114  30.0    1   185 96
271115  34.0    1   185 96

And my y :

0         0
1         0
2         0
3         1
4         0
         ..
271111    0
271112    0
271113    0
271114    0
271115    0
Name: Medal, Length: 271116, dtype: int64

Upvotes: 0

Views: 71

Answers (1)

Corralien
Corralien

Reputation: 120429

You fill missing values for "Height" and "Weight". You should apply the same operation for the feature "Age".

First locate your missing values in this column:

>>> df.loc[df['Age'].isna(), ['ID', 'Name', 'Age'])

If you have few missing values, you can fill with the mean:

>>> df['Age'].fillna(value=df['Age'].mean(), inplace=True)

But if you have many missing values, fill them with the global mean is probably not a good idea. "Age" can depends on "Country", "Sport", "Year" even the "Season" (winter or summer). In fact, this is the same for Height/Weight: the average height in VolleyBall is probably not the same in Archery...

Upvotes: 1

Related Questions