scikit-learn cross_validation need more info about the resulting score

Question

I'm attempting to generate some "which engine works best" data for a project I'm on. My general thoughts were to do something very simple, pick an engine, do a cross validation, generate a list of all the cross validation results, the one that's biggest is "best." All tests done on the same set of teaching data. Here's a snippet of my idea. I would then put this into a loop and instead of setting simple_clf to svm.SVC() have a loop of engines and do the rest of the code for each engine. The base data is in featurevecs and scorenums contains a corresponding score value, 0 to 9, that the particular base data item is supposed to generate.

    X_train, X_test, y_train, y_test = train_test_split(
        featurevecs, scorenums, test_size = 0.333, random_state = 0 )
    # this would be in a loop of engine types but I'm just making sure basic code works
    simple_clf = svm.SVC()
    simple_clf = grid_search.GridSearchCV( simple_clf, CLFPARAMS, cv = 3 )
    simple_clf.fit( X_train, y_train )
    kf = cross_validation.KFold( len( X_train ), k = 5 )
    scores = cross_validation.cross_val_score( simple_clf, X_test, 
                                               y_test, cv = kf )
    print scores.mean(), scores.std() / 2
    # loop would end here

My problem is that scores isn't usable for what I'm supposed to provide in terms of saying what's "best." scores can provide .mean() and .std() for me to print. But I don't want just the results of the engine returning an exact match, but also a "close" match. In my case, close means numeric score is within 1 of expected score. That is if the expected score was a 3, either 2, 3 or 4 would be considered a match and a good result.

I looked through the documentation and it seems the latest bleeding edge version of scikit-learn has an addition to the metrics package that allows a custom score function to be passed to the grid search but I'm unsure if that would be enough for what I need. Because I'd also need to be able to pass that to the cross_val_score function also not just the grid_search, no? Regardless it isn't an option, I'm locked into which version of scikit-learn I have to use.

I also noted a reference to cross_val_predict in the latest bleeding edge version which seems to be just what I need, but again I'm locked into the version I use.

What was done before the bleeding edge when the definition of "good" for cross_validation wasn't the exact match default it used? Surely something was done. I just need to get pointed in the correct direction.

I'm stuck at version 0.11 of scikit-learn because of corporate IT policy, only can use approved software and the version that was approved awhile ago is the only option for me.

Here's what I changed things to, using the helpful hint to look at the cross_val_score in the 0.11 docs and find that it can get a custom score function and that I can write my own as long as it matches parameters. This is the code I have now. Would this do what I'm looking for, that is generating results that are based not just on an exact match but also when "close" where close is defined as within 1.

# KLUDGE way of changing testing from match to close
SCORE_COUNT = 0
SCORE_CROSSOVER_COUNT = 0

def my_custom_score_function( y_true, y_pred ):
    # KLUDGE way of changing testing from match to close
    global SCORE_COUNT, SCORE_CROSSOVER_COUNT
    if( SCORE_COUNT < SCORE_CROSSOVER_COUNT ):
        close_applies = False
    else:
        close_applies = True
    SCORE_COUNT += 1
    print( close_applies, SCORE_CROSSOVER_COUNT, SCORE_COUNT )

    deltas = np.abs( y_true - y_pred )
    good = 0
    for delta in deltas:
        if( delta == 0 ):
            good += 1
        elif( close_applies and ( delta == 1 ) ):
            good += 1

    answer = float( good ) / float( len( y_true ) )
    return answer

Code snippet from main routine:

        fold_count = 5
        # KLUDGE way of changing testing from match to close
        # set global variables for custom scorer function
        global SCORE_COUNT, SCORE_CROSSOVER_COUNT
        SCORE_COUNT = 0
        SCORE_CROSSOVER_COUNT = fold_count

        # do a simple cross validation
        simple_clf = svm.SVC()
        simple_clf = grid_search.GridSearchCV( simple_clf, CLFPARAMS, cv = 3 )
        simple_clf.fit( X_train, y_train )
        print( '{0} '.format( test_type ), end = "" )
        kf = cross_validation.KFold( len( X_train ), k = fold_count )
        scores = cross_validation.cross_val_score( simple_clf, X_train, y_train,
                                                   cv = kf,
                                                   score_func = my_custom_score_function )
        print( 'Accuracy (+/- 0) {1:0.4f} (+/- {2:0.4f}) '.format( scores, scores.mean(),
                                                                   scores.std() / 2 ), 
                                                                   end = "" )
        scores = cross_validation.cross_val_score( simple_clf, X_train, y_train,
                                                   cv = kf,
                                                   score_func = my_custom_score_function )
        print( 'Accuracy (+/- 1) {1:0.4f} (+/- {2:0.4f}) '.format( scores, scores.mean(),
                                                                   scores.std() / 2 ), 
                                                                   end = "" )
         print( "" )

scikit-learn cross_validation need more info about the resulting score

Answers (1)

Related Questions