Reputation: 91
I have an index generated by the pdfbox api class LucenePDFDocument
. As the index contains only the text contents, I wish to search this index effectively.
I will search the 'contents' field with the search string, the result order must be from the most relevant to the less relevant. The code given below did displayed the files that has the words of the searched text, ex 'What is your nationality' but the results didnt contain a file containing this full sentence.
What query parser and query should i use to search in the above said scenario.
Query query = new MultiFieldQueryParser(Version.LUCENE_30, fields,
new StandardAnalyzer(Version.LUCENE_30))
.parse(searchString);
TopScoreDocCollector collector = TopScoreDocCollector.create(5,
false);
searcher.search(query, collector);
ScoreDoc[] hits = collector.topDocs().scoreDocs;
System.out.println("count " + hits.length);
for (ScoreDoc scoreDoc : hits) {
int docId = scoreDoc.doc;
Document d = searcher.doc(docId);
System.out.println(d.getField("path"));
}
Upvotes: 3
Views: 873
Reputation: 28492
It's not about programmatic part, but about Lucene quesry syntax. To search whole phrase just wrap it with double quotes, i.e. instead of searching
What is your nationality
search
"What is your nationality"
Without quotes Lucene finds all documents with each separate word, i.e. "what", "is", "your" and "nationality" ("is" and "your" may be omitted as stop words) and sort them by overall number of occurrences in doc, not only in that phrase. Since you set number of docs to find only to 5 in TopScoreDocCollector
, the file with the phrase may not occur in results. Adding quotes makes Lucene to ignore all other docs without exact phrase.
Also if you search only in 'contents' field, you need not MultiFieldQueryParser
and can use simple QueryParser
instead.
Upvotes: 1