Reputation: 115
from sklearn.model_selection import GridSearchCV, KFold
param_grid = {'select__k': np.arange(1, data_x_numeric.shape[1] + 1)}
cv = KFold(n_splits=3, random_state=1, shuffle=True)
gcv = GridSearchCV(pipe, param_grid, return_train_score=True, cv=cv)
gcv.fit(data_x, data_y)
results = pd.DataFrame(gcv.cv_results_).sort_values(by='mean_test_score', ascending=False)
results.loc[:, ~results.columns.str.endswith("_time")]
After running the above code I get a warning advising that estimator fit failed.
FitFailedWarning: Estimator fit failed. The score on this train-test partition for these parameters will be set to nan. Details:
Traceback (most recent call last):
line 598, in _fit_and_score estimator.fit(X_train, y_train, **fit_params)
"pipeline.py," line 341, in fit Xt = self._fit(X, y, **fit_params_steps) "pipeline.py," line 303, in _fit X, fitted_transformer = fit_transform_one_cached(
"memory.py," line 352, in __call__ return self.func(*args, **kwargs) "pipeline.py," line 754, in _fit_transform_one res = transformer.fit_transform(X, y, **fit_params)
"base.py," line 702, in fit_transform return self.fit(X, y, **fit_params).transform(X)
univariate_selection.py, line 353, in fit score_func_ret = self.score_func(X, y)
"<ipython-input-413-f8e48283bbee>," line 7, in fit_and_score_features
m.fit(Xj, y)
"coxph.py" line 426, in fit delta = solve(optimizer.hessian, optimizer.gradient,
"basic.py," line 214, in solve _solve_check(n, info)
"basic.py," line 29, in _solve_check raise LinAlgError('Matrix is singular.')
numpy.linalg.LinAlgError: Matrix is singular.
warnings.warn("Estimator fit failed. The score on this train-test"
"categorical.py:2630": FutureWarning: The `inplace` parameter in pandas.Categorical.set_categories is deprecated and will be removed in a future version. Removing unused categories will always return a new Categorical object.
res = method(*args, **kwargs)
"categorical.py:2630": FutureWarning: The `inplace` parameter in pandas.Categorical.set_categories is deprecated and will be removed in a future version. Removing unused categories will always return a new Categorical object.
res = method(*args, **kwargs)
"categorical.py:2630": FutureWarning: The `inplace` parameter in pandas.Categorical.set_categories is deprecated and will be removed in a future version. Removing unused categories will always return a new Categorical object.
res = method(*args, **kwargs)
"categorical.py:2630": FutureWarning: The `inplace` parameter in pandas.Categorical.set_categories is deprecated and will be removed in a future version. Removing unused categories will always return a new Categorical object.
res = method(*args, **kwargs)
I get this warning multiple times and the code continues to run for more than 30 minutes. I have removed the routing path for alot of the warning, so that is why it may look different. The above warning is produced multiple times for this block of code.
I am following the Scikit-Survival documentation and am stuck at this point. Some of the additional code provided may help with the error, but I am not sure what is effecting the error.
data_x is a Pandas dataframe with the following data types
data_x.dtypes.astype(str)
f1 category
f2 category
f3 category
f4 float64
f5 category
f6 category
f7 category
f8 category
f9 category
f10 category
f11 category
f12 category
f13 int64
f14 category
f15 category
f16 category
f17 category
f18 category
f19 category
f20 category
f21 int64
dtype: object
data_y is a numpy array
data_y
array([( True, 481.), ( True, 424.), ( True, 519.), ..., ( True, 13.),
( True, 96.), ( True, 6.)],
dtype=[('event', '?'), ('duration', '<f8')])
data_x_numeric is the new dataframe that is onehotencoded for prediction.
data_x_numeric = OneHotEncoder().fit_transform(data_x)
data_x_numeric.head()
I also obtained individual c-index scores for each feature.
def fit_and_score_features(X, y):
n_features = X.shape[1]
scores = np.empty(n_features)
m = CoxPHSurvivalAnalysis()
for j in range(n_features):
Xj = X[:, j:j+1]
m.fit(Xj, y)
scores[j] = m.score(Xj, y)
return scores
scores = fit_and_score_features(data_x_numeric.values, data_y)
pd.Series(scores, index=data_x_numeric.columns).sort_values(ascending=False)
f1 0.631355
f2 0.564762
f3 0.564288
f4 0.554376
f5 0.549956
...
f94 0.498701
f95 0.498413
f96 0.483840
f97 0.460941
f98 0.460898
I then created a pipeline.
#Creates pipline
from sklearn.feature_selection import SelectKBest
from sklearn.pipeline import Pipeline
pipe = Pipeline([('encode', OneHotEncoder()),
('select', SelectKBest(fit_and_score_features, k=3)),
('model', CoxPHSurvivalAnalysis())])
This is the point where I applied the code from the beginning of the post in order to select my best features to maximize the overall c-index score. I am not quite sure what is going on and would greatly appreciate any help.
Upvotes: 2
Views: 31235
Reputation: 1
I resolved this issue by changing the 'penalty' term. I had 'elasticnet' as penalty for my Logsitc Regression model which was making some of the coefficients 0. The trick is to use a penalty that does not make the coefficients 0.
Upvotes: 0
Reputation: 1
Check for NaN values present in the dataset. I had the same error and it got resolved after replacing them.
Upvotes: 0
Reputation: 1
for me error arised because in parameters grid I took max_features:[1,3,5,7]
but my data only has 6 features so for me it showed fit failed error. But after i removed 7 and left max_features:[1,3,5]
then my code runned very perfectly.
So i would suggest everybody to check the hyper parameters they are passing before doing randomized search cv
Upvotes: 0
Reputation: 11
Check for missing data. I had the same error. The program ran fine once I deleted the rows with empty cells.
Upvotes: 1