Reputation: 1043
I have 50 products. For each product, I want to identify the following four related products using similarity measures.
I want to compare the ranked list generated by my model (predicted) with the ranked list specified by the domain experts (ground truth).
Through reading, I found that I may use rank correlation based approaches such as Kendall Tau/Spearmen to compare the ranked lists. However, I am not sure if these approaches are suitable as my number of samples is low (4). Please correct me if I am wrong.
Another approach is to use Jaccard similarity (set intersection) to quantify the similarity between two ranked list. Then, I may plot histogram from the setbased_list (see below).
for index, row in evaluate.iterrows():
d= row['Id']
y_pred = [3,2,1,0]
y_true = [row['A'],row['B'],row['C'],row['D']]
sim = jaccard_similarity_score(y_true, y_pred)
setbased_list.append(sim)
Is my approach to the problem above correct?
What are other approaches that I may use if I want to take into consideration the positions of elements in the list (weight-based)?
Upvotes: 3
Views: 1571
Reputation: 6284
From the way you have described the problem, it sounds as if you might as well just assign an arbitrary score for each item on your list - e.g. 3 points for the same item at the same rank as on the 'training' list, 1 point for the same item but at a different rank, or something like that.
I'm not clear on the role of the 'not related' item though - are the other 45 items all equally 'not related' to the target item and if so does it matter which one you choose? Perhaps you need to take points away from the score if the 'not related' item appears in one of the 'related' positions? That subtlety might not be captured by a standard nonparametric correlation measure.
If it's important that you use a standard, statistically based measure for some reason then you are probably better off asking on Cross Validated.
Upvotes: 1