henryjhu
henryjhu

Reputation: 9

Why cross_val_score returned all NaN?

There are several posts about this similar question, but none has the answer I want. Therefore, I am making this new post to took for the answer.

The question is why are all the scores have NaN as values?
The data seems fine.
The setup seems fine.
What could be causing this???

Here is my data frame:

df.describe(include = "all")

            TRXN_MONTH      TRANSACTION_AMOUNT
count       598565.000000   5.985650e+05
mean        6.410199        2.457275e+07
std         3.446896        2.732986e+08
min         1.000000        2.000000e-02
25%         3.000000        1.823501e+04
50%         6.000000        1.649049e+05
75%         9.000000        1.318875e+06
max         12.000000       1.694837e+10

Variable data types:

df.dtypes

TRXN_MONTH            int64
TRANSACTION_AMOUNT    float64
dtype:                object

Here is my code:

# Ensemble by stacking
estimator_list = [
    ('lof', LocalOutlierFactor(novelty=False, n_neighbors=20, contamination='auto')),
    ('iforest', IsolationForest(n_estimators=100, contamination='auto'))
]

ensemble = StackingClassifier(estimators=estimator_list, final_estimator=LogisticRegression(), cv=5)

# Set the number of folds and how parameter values are shuffled
kf = model_selection.KFold(n_splits=10, random_state=10, shuffle=True) 

# Evaluate model using cross-validation
ensemble_cross_vald=model_selection.cross_val_score(ensemble, df_train[['TRXN_MONTH']].values, df_train[['TRANSACTION_AMOUNT']].values, cv=kf, n_jobs=-1, scoring='recall')

ensemble_cross_vald

Here is the output after running the code:

array([nan, nan, nan, nan, nan, nan, nan, nan, nan, nan])

I am able to use each model to fit and predict:

y_pred = lof.fit_predict(df) 
lofs_index = where(y_pred==-1)
lofs_index

(array([    17,     43,     61, ..., 598553, 598561, 598562]),)

y_pred = iforest.fit_predict(df) 
lofs_index = where(y_pred==-1)
lofs_index

(array([     6,     14,     15, ..., 598549, 598556, 598561]),)

Upvotes: 0

Views: 873

Answers (1)

desertnaut
desertnaut

Reputation: 60390

None of the base models you use for stacking here is a classifier itself: both are outlier and/or anomaly detection algorithms; they are not even supervised models, and they do not use the labels y at all, as you can see from their respective fit documentation:

For LOF:

fit(X, y=None)

y : Ignored

Not used, present for API consistency by convention.

For Isolation Forest:

fit(X, y=None, sample_weight=None)

y : Ignored

Not used, present for API consistency by convention.

Not being classifiers (and not involving in any manner the labels y), it should be apparent that these models cannot be used for classification tasks, either by themselves or as base classifiers of a stacked model, as you are trying to do here. Hence, the nan values of the Recall metric are expected (since there are not any labels y from the perspective of the models, there is indeed not any recall in the first place).

Upvotes: 1

Related Questions