Reputation: 369
I am working on Lucene, and had some questions about some queries which have been giving me different results.
The three query are:
Q1 = "Java 8 is verified to be compatible"
Q2 = "Java 8 not verified as a compatible"
Q3 = "Java 8 not verified as compatible"
I am trying to understand why the results of Q1 and Q2 are so similar, but different to Q3.
Can anyone also tell me what this type of information retrieval issue this is? I know it is something to do with natural language and IR indexing and query language.
Thank you.
Upvotes: 1
Views: 41
Reputation: 33341
I'm guessing you are using StandardAnalyzer
with a default (english) stop set or EnglishAnalyzer
.
Either way, stop words are the thing to focus on here. "is", "not", "to", "be", "as", and "a" are all stop words, and so are effectively eliminated from the index.
Though they are no longer searchable, eliminated stop words do leave a position increment behind, so the post-analysis representation will really be something like
Q1 = "Java 8 _ verified _ _ compatible"
Q2 = "Java 8 _ verified _ _ compatible"
Q3 = "Java 8 _ verified _ compatible"
Where an _ represents some eliminated, unsearchable stopword. So based on that representation, Q1 and Q2, after analysis, are actually identical queries. Q3, though, has one less position increment, meaning that it is expecting one less word (or token) to appear between "verified" and "compatible".
Upvotes: 2