Che-Hao Kang
Che-Hao Kang

Reputation: 163

Sklearn - GridSearchCV with v_measure_score is NOT the same

I am trying to use GridSearchCV with v_measure_score and compare the result
with another method WITHOUT GridSearchCV.

The best score of v_measure_score by for-loop is 0.69816019299 with percentile 27;
the best score of GridSearchCV is 0.565562627046 with percentile 12.

In my opinion, the results should be the same.
I've checked my code several times but still cannot figure out the reason. The following is my code:

GridSearchCV

estimators = [('tfIdf', TfidfTransformer()), ('sPT', SelectPercentile()), ('kmeans', cluster.KMeans())]
pipe = Pipeline(estimators)
params = dict(tfIdf__smooth_idf=[True],
              sPT__score_func= [f_classif], sPT__percentile=range(100, 0, -1),
              kmeans__n_clusters=[clusterNum], kmeans__random_state=[0], kmeans__precompute_distances=[True])
v_measure_scorer = make_scorer(v_measure_score)
grid_search = GridSearchCV(pipe, param_grid=params, scoring=v_measure_scorer)
grid_search_fit = grid_search.fit(apiVectorArray, yTarget)

v_measure_score by for-loop

bestPercent = [-1, -1]
for percent in xrange(100, 0, -1):
    transformer = TfidfTransformer(smooth_idf=True)
    apiVectorArrayTFIDF = transformer.fit_transform(apiVectorArray)
    apiVectorFit = SelectPercentile(f_classif, percentile=percent).fit(apiVectorArrayTFIDF, yTarget)
    k_means = cluster.KMeans(n_clusters=clusterNum, random_state=0, precompute_distances=True).fit(apiVectorFit.transform(apiVectorArrayTFIDF))

    if v_measure_score(yTarget, k_means.labels_) > bestPercent[1]:
        bestPercent[0] = percent
        bestPercent[1] = v_measure_score(yTarget, k_means.labels_)

I tried to add color on my code but failed.
Sorry for your eyes.

Thanks.

Upvotes: 0

Views: 198

Answers (1)

Che-Hao Kang
Che-Hao Kang

Reputation: 163

I think the answer is because GridSearchCV uses Cross-Validation to fit the data, the score is different from for-loop.

Upvotes: 0

Related Questions