Reputation: 9
There are several posts about this similar question, but none has the answer I want. Therefore, I am making this new post to took for the answer.
The question is why are all the scores have NaN as values?
The data seems fine.
The setup seems fine.
What could be causing this???
Here is my data frame:
df.describe(include = "all")
TRXN_MONTH TRANSACTION_AMOUNT
count 598565.000000 5.985650e+05
mean 6.410199 2.457275e+07
std 3.446896 2.732986e+08
min 1.000000 2.000000e-02
25% 3.000000 1.823501e+04
50% 6.000000 1.649049e+05
75% 9.000000 1.318875e+06
max 12.000000 1.694837e+10
Variable data types:
df.dtypes
TRXN_MONTH int64
TRANSACTION_AMOUNT float64
dtype: object
Here is my code:
# Ensemble by stacking
estimator_list = [
('lof', LocalOutlierFactor(novelty=False, n_neighbors=20, contamination='auto')),
('iforest', IsolationForest(n_estimators=100, contamination='auto'))
]
ensemble = StackingClassifier(estimators=estimator_list, final_estimator=LogisticRegression(), cv=5)
# Set the number of folds and how parameter values are shuffled
kf = model_selection.KFold(n_splits=10, random_state=10, shuffle=True)
# Evaluate model using cross-validation
ensemble_cross_vald=model_selection.cross_val_score(ensemble, df_train[['TRXN_MONTH']].values, df_train[['TRANSACTION_AMOUNT']].values, cv=kf, n_jobs=-1, scoring='recall')
ensemble_cross_vald
Here is the output after running the code:
array([nan, nan, nan, nan, nan, nan, nan, nan, nan, nan])
I am able to use each model to fit and predict:
y_pred = lof.fit_predict(df)
lofs_index = where(y_pred==-1)
lofs_index
(array([ 17, 43, 61, ..., 598553, 598561, 598562]),)
y_pred = iforest.fit_predict(df)
lofs_index = where(y_pred==-1)
lofs_index
(array([ 6, 14, 15, ..., 598549, 598556, 598561]),)
Upvotes: 0
Views: 873
Reputation: 60390
None of the base models you use for stacking here is a classifier itself: both are outlier and/or anomaly detection algorithms; they are not even supervised models, and they do not use the labels y
at all, as you can see from their respective fit
documentation:
For LOF:
fit(X, y=None)
y : Ignored
Not used, present for API consistency by convention.
For Isolation Forest:
fit(X, y=None, sample_weight=None)
y : Ignored
Not used, present for API consistency by convention.
Not being classifiers (and not involving in any manner the labels y
), it should be apparent that these models cannot be used for classification tasks, either by themselves or as base classifiers of a stacked model, as you are trying to do here. Hence, the nan
values of the Recall metric are expected (since there are not any labels y
from the perspective of the models, there is indeed not any recall in the first place).
Upvotes: 1