Reputation: 163
I am trying to use GridSearchCV with v_measure_score
and compare the result
with another method WITHOUT GridSearchCV.
The best score of v_measure_score by for-loop is 0.69816019299 with percentile 27;
the best score of GridSearchCV is 0.565562627046 with percentile 12.
In my opinion, the results should be the same.
I've checked my code several times but still cannot figure out the reason.
The following is my code:
GridSearchCV
estimators = [('tfIdf', TfidfTransformer()), ('sPT', SelectPercentile()), ('kmeans', cluster.KMeans())]
pipe = Pipeline(estimators)
params = dict(tfIdf__smooth_idf=[True],
sPT__score_func= [f_classif], sPT__percentile=range(100, 0, -1),
kmeans__n_clusters=[clusterNum], kmeans__random_state=[0], kmeans__precompute_distances=[True])
v_measure_scorer = make_scorer(v_measure_score)
grid_search = GridSearchCV(pipe, param_grid=params, scoring=v_measure_scorer)
grid_search_fit = grid_search.fit(apiVectorArray, yTarget)
v_measure_score by for-loop
bestPercent = [-1, -1]
for percent in xrange(100, 0, -1):
transformer = TfidfTransformer(smooth_idf=True)
apiVectorArrayTFIDF = transformer.fit_transform(apiVectorArray)
apiVectorFit = SelectPercentile(f_classif, percentile=percent).fit(apiVectorArrayTFIDF, yTarget)
k_means = cluster.KMeans(n_clusters=clusterNum, random_state=0, precompute_distances=True).fit(apiVectorFit.transform(apiVectorArrayTFIDF))
if v_measure_score(yTarget, k_means.labels_) > bestPercent[1]:
bestPercent[0] = percent
bestPercent[1] = v_measure_score(yTarget, k_means.labels_)
I tried to add color on my code but failed.
Sorry for your eyes.
Thanks.
Upvotes: 0
Views: 198
Reputation: 163
I think the answer is because GridSearchCV uses Cross-Validation to fit the data, the score is different from for-loop.
Upvotes: 0