Vijay Veeraraghavan
Vijay Veeraraghavan

Reputation: 91

Lucene - Effective text search

I have an index generated by the pdfbox api class LucenePDFDocument. As the index contains only the text contents, I wish to search this index effectively.

I will search the 'contents' field with the search string, the result order must be from the most relevant to the less relevant. The code given below did displayed the files that has the words of the searched text, ex 'What is your nationality' but the results didnt contain a file containing this full sentence.

What query parser and query should i use to search in the above said scenario.

      Query query = new MultiFieldQueryParser(Version.LUCENE_30, fields,
                new StandardAnalyzer(Version.LUCENE_30))
                .parse(searchString);

      TopScoreDocCollector collector = TopScoreDocCollector.create(5,
                false);
        searcher.search(query, collector);
        ScoreDoc[] hits = collector.topDocs().scoreDocs;
        System.out.println("count " + hits.length);
        for (ScoreDoc scoreDoc : hits) {
            int docId = scoreDoc.doc;
            Document d = searcher.doc(docId);
            System.out.println(d.getField("path"));
        }

Upvotes: 3

Views: 873

Answers (1)

ffriend
ffriend

Reputation: 28492

It's not about programmatic part, but about Lucene quesry syntax. To search whole phrase just wrap it with double quotes, i.e. instead of searching

What is your nationality

search

"What is your nationality"

Without quotes Lucene finds all documents with each separate word, i.e. "what", "is", "your" and "nationality" ("is" and "your" may be omitted as stop words) and sort them by overall number of occurrences in doc, not only in that phrase. Since you set number of docs to find only to 5 in TopScoreDocCollector, the file with the phrase may not occur in results. Adding quotes makes Lucene to ignore all other docs without exact phrase.

Also if you search only in 'contents' field, you need not MultiFieldQueryParser and can use simple QueryParser instead.

Upvotes: 1

Related Questions