Reputation: 4162
I have a pipeline working perfectly and now I want to check the top-k accuracy. I can obviously do this by running a loop the hard way but how can I do the same using the given function?
from sklearn.metrics import top_k_accuracy_score
# x and y can be any random feature and labels. Please assume
y = df_whole['target'].values.ravel() # get 1-D y labels currently in String format
set_y = set(y) # unique classes
class_int_mapping = dict(zip(set_y,range(len(set_y)))) # change car : 0, bus : 1 etc..
y = np.array([class_int_mapping[i] for i in y]) # array. List also works
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size = 0.25,stratify = y)
When I train and test my pipeline, it gives desired results. Please assume any classification pipeline. When I do,
print(pipeline.predict_proba(x_train).shape, pipeline.predict_proba(x_test).shape)
>> (19794, 269) (6599, 269)
and when I do:
top_k_accuracy_score(y_test,pipeline.predict_proba(x_test), k = 5)
it gives me error as:
ValueError: Number of classes in 'y_true' (255) not equal to the number of classes in 'y_score' (269).
How can this be happening?
P.S.: For now, I am doing it like:
probs = pipeline.predict_proba(x_test)
topn = np.argsort(probs, axis = 1)[:,-5:]
top_k_acc_result = np.mean(np.array([1 if y_test[k] in topn[k] else 0 for k in range(len(topn))]))
Upvotes: 0
Views: 1274
Reputation: 46978
You are missing some of the labels in your prediction, hence the number of columns in the probability and the number of classes do not tally. You can provide the labels with top_k_accuracy_score(..,labels=)
For example:
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import top_k_accuracy_score
from sklearn.model_selection import train_test_split
X, Y = make_classification(n_samples=500,n_classes=6,n_informative=7,random_state=33)
x_train, x_test, y_train, y_test = train_test_split(X,Y,test_size = 0.25,stratify = Y)
clf = RandomForestClassifier()
clf.fit(x_train,y_train)
Works well if we do:
top_k_accuracy_score(y_test,clf.predict_proba(x_test), k = 2)
If for some reason we are missing class 5 in the prediction, it throws an error:
ix = y_test != 5
top_k_accuracy_score(y_test[ix],clf.predict_proba(x_test[ix,:]), k = 2)
You can provide the labels:
top_k_accuracy_score(Y[ix],clf.predict_proba(X[ix,:]), k = 2,labels=np.unique(Y))
Upvotes: 1