Reputation: 75
I'm trying to build a model that predicts the probability for an athlete to win a medal. I have a dataframe that looks like this:
Here is what I've already done
#Cleaning df
#Replace NaN with mean or average
df['Height'].fillna(value=df['Height'].mean(), inplace=True)
df['Weight'].fillna(value=df['Weight'].mean(), inplace=True)
#Changing type to integer
df.Height = df.Height.astype(int)
df.Weight = df.Weight.astype(int)
#Target variable
y= df["Medal"]
#If Male =0, if female = 1
df['Sex'] = df['Sex'].apply(lambda x: 1 if str(x) != 'M' else 0)
#Predictive
feature_names = ["Age", "Sex", "Height", "Weight"]
X= df[feature_names]
#Regressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import cross_val_score
regressor = DecisionTreeRegressor(random_state=0)
cross_val_score(regressor, X, y, cv=10)
But when I run the code, it returns me an error
warnings.warn("Estimator fit failed. The score on this train-test"
C:\Users\miss_\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py:610: FitFailedWarning: Estimator fit failed. The score on this train-test partition for these parameters will be set to nan. Details:
Traceback (most recent call last):
File "C:\Users\miss_\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 593, in _fit_and_score
estimator.fit(X_train, y_train, **fit_params)
File "C:\Users\miss_\anaconda3\lib\site-packages\sklearn\tree\_classes.py", line 1247, in fit
super().fit(
File "C:\Users\miss_\anaconda3\lib\site-packages\sklearn\tree\_classes.py", line 156, in fit
X, y = self._validate_data(X, y,
File "C:\Users\miss_\anaconda3\lib\site-packages\sklearn\base.py", line 430, in _validate_data
X = check_array(X, **check_X_params)
File "C:\Users\miss_\anaconda3\lib\site-packages\sklearn\utils\validation.py", line 63, in inner_f
return f(*args, **kwargs)
File "C:\Users\miss_\anaconda3\lib\site-packages\sklearn\utils\validation.py", line 663, in check_array
_assert_all_finite(array,
File "C:\Users\miss_\anaconda3\lib\site-packages\sklearn\utils\validation.py", line 103, in _assert_all_finite
raise ValueError(
ValueError: Input contains NaN, infinity or a value too large for dtype('float32').
And returns an array like this : array[NaN, NaN, NaN...]
My X looks like this
Age Sex Height Weight
0 24.0 1 180 80
1 23.0 1 170 60
2 24.0 1 175 70
3 34.0 1 175 70
4 21.0 1 185 82
... ... ... ... ...
271111 29.0 1 179 89
271112 27.0 1 176 59
271113 27.0 1 176 59
271114 30.0 1 185 96
271115 34.0 1 185 96
And my y :
0 0
1 0
2 0
3 1
4 0
..
271111 0
271112 0
271113 0
271114 0
271115 0
Name: Medal, Length: 271116, dtype: int64
Upvotes: 0
Views: 71
Reputation: 120429
You fill missing values for "Height" and "Weight". You should apply the same operation for the feature "Age".
First locate your missing values in this column:
>>> df.loc[df['Age'].isna(), ['ID', 'Name', 'Age'])
If you have few missing values, you can fill with the mean:
>>> df['Age'].fillna(value=df['Age'].mean(), inplace=True)
But if you have many missing values, fill them with the global mean is probably not a good idea. "Age" can depends on "Country", "Sport", "Year" even the "Season" (winter or summer). In fact, this is the same for Height/Weight: the average height in VolleyBall is probably not the same in Archery...
Upvotes: 1