Reputation: 879
I'm trying to implement Explicit semantic analysis(ESA) via Lucene.
How do I take a term TFIDF in a query into consideration when matching documents?
For example:
The query should match Doc1 better than 2.
I'd like this to work without impacting performance.
I'm doing this through query boosting. By boosting terms relative to their TFIDFs.
Is there a better way?
Upvotes: 2
Views: 928
Reputation: 33341
Lucene already supports TF/IDF scoring, of course, by default, so not quite sure I understand what you are looking for.
It actually sounds a bit like you want to weigh query terms based on their TF/IDF within the query itself. So lets consider the two elements of that:
TF: Lucene sums the score of each query term. If the same query term appears twice, in a query (like field:(a a b)
), the doubled term would recieve heavier weight comparable to (though by no means identical to) boosting by 2.
IDF: idf refers to data across a multiple document corpus. Since there is only one query, this doesn't apply. Or if you want to get technical about it, all terms have an idf of 1.
So, IDF doesn't really make sense in that context, and TF is already done for you. So, you don't really need to do anything.
Keep in mind, thought, that there are other scoring elements though! The coord
factor is significant here.
a b a
matches four of the query terms (a b a a
, but not c d
)a b c
matches five of the query terms (a b a c a
, but not d
)So, that particular scoring element will score the second document more strongly.
Here's the explain
(see IndexSearcher.explain) output for doc a b a
:
0.26880693 = (MATCH) product of:
0.40321037 = (MATCH) sum of:
0.10876686 = (MATCH) weight(text:a in 0) [DefaultSimilarity], result of:
0.10876686 = score(doc=0,freq=2.0 = termFreq=2.0
), product of:
0.25872254 = queryWeight, product of:
0.5945349 = idf(docFreq=2, maxDocs=2)
0.435168 = queryNorm
0.42039964 = fieldWeight in 0, product of:
1.4142135 = tf(freq=2.0), with freq of:
2.0 = termFreq=2.0
0.5945349 = idf(docFreq=2, maxDocs=2)
0.5 = fieldNorm(doc=0)
0.07690979 = (MATCH) weight(text:b in 0) [DefaultSimilarity], result of:
0.07690979 = score(doc=0,freq=1.0 = termFreq=1.0
), product of:
0.25872254 = queryWeight, product of:
0.5945349 = idf(docFreq=2, maxDocs=2)
0.435168 = queryNorm
0.29726744 = fieldWeight in 0, product of:
1.0 = tf(freq=1.0), with freq of:
1.0 = termFreq=1.0
0.5945349 = idf(docFreq=2, maxDocs=2)
0.5 = fieldNorm(doc=0)
0.10876686 = (MATCH) weight(text:a in 0) [DefaultSimilarity], result of:
0.10876686 = score(doc=0,freq=2.0 = termFreq=2.0
), product of:
0.25872254 = queryWeight, product of:
0.5945349 = idf(docFreq=2, maxDocs=2)
0.435168 = queryNorm
0.42039964 = fieldWeight in 0, product of:
1.4142135 = tf(freq=2.0), with freq of:
2.0 = termFreq=2.0
0.5945349 = idf(docFreq=2, maxDocs=2)
0.5 = fieldNorm(doc=0)
0.10876686 = (MATCH) weight(text:a in 0) [DefaultSimilarity], result of:
0.10876686 = score(doc=0,freq=2.0 = termFreq=2.0
), product of:
0.25872254 = queryWeight, product of:
0.5945349 = idf(docFreq=2, maxDocs=2)
0.435168 = queryNorm
0.42039964 = fieldWeight in 0, product of:
1.4142135 = tf(freq=2.0), with freq of:
2.0 = termFreq=2.0
0.5945349 = idf(docFreq=2, maxDocs=2)
0.5 = fieldNorm(doc=0)
0.6666667 = coord(4/6)
And for doc a b c
:
0.43768594 = (MATCH) product of:
0.52522314 = (MATCH) sum of:
0.07690979 = (MATCH) weight(text:a in 1) [DefaultSimilarity], result of:
0.07690979 = score(doc=1,freq=1.0 = termFreq=1.0
), product of:
0.25872254 = queryWeight, product of:
0.5945349 = idf(docFreq=2, maxDocs=2)
0.435168 = queryNorm
0.29726744 = fieldWeight in 1, product of:
1.0 = tf(freq=1.0), with freq of:
1.0 = termFreq=1.0
0.5945349 = idf(docFreq=2, maxDocs=2)
0.5 = fieldNorm(doc=1)
0.07690979 = (MATCH) weight(text:b in 1) [DefaultSimilarity], result of:
0.07690979 = score(doc=1,freq=1.0 = termFreq=1.0
), product of:
0.25872254 = queryWeight, product of:
0.5945349 = idf(docFreq=2, maxDocs=2)
0.435168 = queryNorm
0.29726744 = fieldWeight in 1, product of:
1.0 = tf(freq=1.0), with freq of:
1.0 = termFreq=1.0
0.5945349 = idf(docFreq=2, maxDocs=2)
0.5 = fieldNorm(doc=1)
0.07690979 = (MATCH) weight(text:a in 1) [DefaultSimilarity], result of:
0.07690979 = score(doc=1,freq=1.0 = termFreq=1.0
), product of:
0.25872254 = queryWeight, product of:
0.5945349 = idf(docFreq=2, maxDocs=2)
0.435168 = queryNorm
0.29726744 = fieldWeight in 1, product of:
1.0 = tf(freq=1.0), with freq of:
1.0 = termFreq=1.0
0.5945349 = idf(docFreq=2, maxDocs=2)
0.5 = fieldNorm(doc=1)
0.217584 = (MATCH) weight(text:c in 1) [DefaultSimilarity], result of:
0.217584 = score(doc=1,freq=1.0 = termFreq=1.0
), product of:
0.435168 = queryWeight, product of:
1.0 = idf(docFreq=1, maxDocs=2)
0.435168 = queryNorm
0.5 = fieldWeight in 1, product of:
1.0 = tf(freq=1.0), with freq of:
1.0 = termFreq=1.0
1.0 = idf(docFreq=1, maxDocs=2)
0.5 = fieldNorm(doc=1)
0.07690979 = (MATCH) weight(text:a in 1) [DefaultSimilarity], result of:
0.07690979 = score(doc=1,freq=1.0 = termFreq=1.0
), product of:
0.25872254 = queryWeight, product of:
0.5945349 = idf(docFreq=2, maxDocs=2)
0.435168 = queryNorm
0.29726744 = fieldWeight in 1, product of:
1.0 = tf(freq=1.0), with freq of:
1.0 = termFreq=1.0
0.5945349 = idf(docFreq=2, maxDocs=2)
0.5 = fieldNorm(doc=1)
0.8333333 = coord(5/6)
Note that, as desired, the matches against term a
receive higher weight in the first document, and you also see each independent a
evaluated in separately and added into the score.
Also note, however, the difference in coord, and on the idf of the term "c" in the second doc. These score impacts are just wiping out the boost you get from adding multiples of the same term. If you add enough a
s to the query, they will eventually swap places. The match on c
is just evaluated to be a far more signficant result.
Upvotes: 2