Reputation: 18810
I am trying to build a graph of precision and recall using click data. I have two data sources.
I used python and put these in two data sources into dictionaries as follows:
>>> print clicked_data
{101: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], 103: [20, 21, 22, 23, 24, 25, 26, 27, 28, 29]}
>>> print all_relevant_data
{101: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17], 103: [20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49]}
I was reading the article in scikit-learn web site (http://scikit-learn.org/stable/auto_examples/plot_precision_recall.html) and tried to follow the formula but got really confuse really setting up False Positive and False Negative.
following the equations in scikit-learn: According to above example prevision for item 101
P = T_positive/ (T_positive + F_positive)
>>> float(len(clicked_data[101]))/float(len(all_relevant_data[101]))
0.5555555555555556
But as I try to figure out Recall I am having trouble in getting False Negative items of click data. In theory False Negative means incorrectly marked. All I have is user clicked data for a given id and all the items that are relevant to that id.
R = T_positive / (T_positive + F_negative)
How can I calculate precision and recall correctly so I can build a graph.
On a different note if this is not a good metric to evaluate the results, considering the fact that I only have above mention data what would be the good metric?
Upvotes: 1
Views: 2706
Reputation: 37711
You can compute precision@k, recall@k based on your dataset. But you need a ranking of the documents to compute them.
Dataset
A well known dataset is AOL Search Query Logs which you can use to build a retrieval-based system (you just need a dataset and a retrieval function) to compute precision, recall, average precision and mean average precision. I am briefly explaining the terms mentioned above.
Document Ranking / Retrieval Function
Okapi BM25 (BM stands for Best Matching) is a ranking function used by search engines to rank matching documents according to their relevance to a given search query. It is based on the probabilistic retrieval framework. BM25 is a bag-of-words retrieval function that ranks a set of documents based on the query terms appearing in each document, regardless of the inter-relationship between the query terms within a document (e.g., their relative proximity). See the Wikipedia page for more details.
Precision and Recall
Precision measures "of all the documents we retrieved as relevant how many are actually relevant?".
Precision = No. of relevant documents retrieved / No. of total documents retrieved
Recall measures "Of all the actual relevant documents how many did we retrieve as relevant?".
Recall = No. of relevant documents retrieved / No. of total relevant documents
Suppose, when a query "q" is submitted to an information retrieval system (ex., search engine) having 100 relevant documents w.r.t. the query "q", the system retrieves 68 documents out of total collection of 600 documents. Out of 68 retrieved documents, 40 documents were relevant. So, in this case:
Precision = 40 / 68 = 58.8%
and Recall = 40 / 100 = 40%
F-Score / F-measure is the weighted harmonic mean of precision and recall. The traditional F-measure or balanced F-score is:
F-Score = 2 * Precision * Recall / Precision + Recall
Average Precision
You can think of it this way: you type something in Google
and it shows you 10 results. It’s probably best if all of them were relevant. If only some are relevant, say five of them, then it’s much better if the relevant ones are shown first. It would be bad if first five were irrelevant and good ones only started from sixth, wouldn’t it? AP score reflects this.
Giving an example below:
AvgPrec of the two rankings:
Ranking#1: (1.0 + 0.67 + 0.75 + 0.8 + 0.83 + 0.6) / 6 = 0.78
Ranking#2: (0.5 + 0.4 + 0.5 + 0.57 + 0.56 + 0.6) / 6 = 0.52
Mean Average Precision (MAP)
MAP is mean of average precision across multiple queries/rankings. Giving an example for illustration.
Mean average precision for the two queries:
For query 1, AvgPrec: (1.0+0.67+0.5+0.44+0.5) / 5 = 0.62
For query 2, AvgPrec: (0.5+0.4+0.43) / 3 = 0.44
So, MAP = (0.62 + 0.44) / 2 = 0.53
Sometimes, people use precision@k
, recall@k
as performance measure of a retrieval system. You should build a retrieval system for such testings. If you want to write your program in Java, you should consider Apache Lucene to build your index.
Upvotes: 3
Reputation: 63082
With just the clicked (TP) and relevant data (TP + FN) you do have the means to calculate the Recall - but not the Precision . You have no dataset that tells you the FP values.
Upvotes: 0