Tony
Tony

Reputation: 12695

Lucene.net - can't do a multiword search

I have stored the following documents in my lucene index:

{
"id" : 1,
"name": "John Smith"
"description": "worker"
"additionalData": "faster data"
"attributes": "is_hired=not"
},
{
"id" : 2,
"name": "Alan Smith"
"description": "hired"
"additionalData": "faster drive"
"attributes": "is_hired=not"
},
{
"id" : 3,
"name": "Mike Std"
"description": "hired"
"additionalData": "faster check"
"attributes": "is_hired=not"
}

and now I want to seach over all the fields to check if the given value exists:

search term: "John data check"

which sould me return the documents with ID 1 and 3. But it doesn't, why ?

var analyzer = new StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_30);

BooleanQuery mainQuery = new BooleanQuery();
mainQuery.MinimumNumberShouldMatch = 1;

var cols = new string[] {
                         "name",
                         "additionalData"
                        };

 string[] words = searchData.text.Split(new string[] { " " }, StringSplitOptions.RemoveEmptyEntries);

 var queryParser = new MultiFieldQueryParser(Lucene.Net.Util.Version.LUCENE_30, cols, analyzer);

 foreach (var word in words)
 {
    BooleanQuery innerQuery = new BooleanQuery();
    innerQuery.MinimumNumberShouldMatch = 1;

    innerQuery.Add(queryParser.Parse(word), Occur.SHOULD);

    mainQuery.Add(innerQuery, Occur.MUST);
 }

 TopDocs hits = searcher.Search(mainQuery, null, int.MaxValue, Sort.RELEVANCE);

 //hits.TotalHits is 0 !!

Upvotes: 2

Views: 328

Answers (2)

Tony
Tony

Reputation: 12695

Well, in my case I stored a string array with the same field name, I had to retrieve all field values from the result Document, because the Document.Get("field_name") returns only the first column value when there are many fields with the same way

var multi_fields = doc.GetFields("field_name");
var field_values = multi_fields.Select(x => x.StringValue).ToArray();

Plus, I had to enable the WildCard search, because it fails if I don't type a full word, e.g. Jo instead of John

 string[] words = "Jo data check".Split(new string[] { " " }, StringSplitOptions.RemoveEmptyEntries).Select(x => string.Format("*{0}*", x)).ToArray();

 var queryParser = new MultiFieldQueryParser(Lucene.Net.Util.Version.LUCENE_30, cols, analyzer);
 parser.AllowLeadingWildcard = true;

Upvotes: 0

Lucas Trzesniewski
Lucas Trzesniewski

Reputation: 51330

The query you constructed basically requires all three words to match.

You wrap each word in a BooleanQuery with a SHOULD clause. This is equivalent to using the inner query directly (you're just adding an indirection which does not change the behavior of the query). The boolean query has only one clause, which should match for the boolean query to match.

Then, you wrap each one of these in another boolean query, this time with a MUST clause for each. This means each clause must match for the query to match.

For a BooleanQuery to match, all MUST clauses have to be satisfied, and if there are none, then a minimum of MinimumNumberShouldMatch SHOULD clauses have to be satisfied. Leave that property at its default value, as the documented behavior is:

By default no optional clauses are necessary for a match (unless there are no required clauses).

Effectively, your query is (assuming there is no MultiFieldQueryParser for simplicity):

+(john) +(data) +(check)

Or, in a tree form:

BooleanQuery
    MUST: BooleanQuery
        SHOULD: TermQuery: john
    MUST: BooleanQuery
        SHOULD: TermQuery: data
    MUST: BooleanQuery
        SHOULD: TermQuery: check

Which can be simplified to:

BooleanQuery
    MUST: TermQuery: john
    MUST: TermQuery: data
    MUST: TermQuery: check

But the query you want is:

BooleanQuery
    SHOULD: TermQuery: john
    SHOULD: TermQuery: data
    SHOULD: TermQuery: check

So, remove the mainQuery.MinimumNumberShouldMatch = 1; line, then replace your foreach body with the following and it should get the job done:

mainQuery.Add(queryParser.Parse(word), Occur.SHOULD);

Ok, so here's a full example, which works for me:

var analyzer = new StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_30);

var directory = new RAMDirectory();

using (var writer = new IndexWriter(directory, analyzer, true, IndexWriter.MaxFieldLength.UNLIMITED))
{
    var doc = new Document();
    doc.Add(new Field("id", "1", Field.Store.YES, Field.Index.NOT_ANALYZED));
    doc.Add(new Field("name", "John Smith", Field.Store.NO, Field.Index.ANALYZED));
    doc.Add(new Field("additionalData", "faster data", Field.Store.NO, Field.Index.ANALYZED));
    writer.AddDocument(doc);

    doc = new Document();
    doc.Add(new Field("id", "2", Field.Store.YES, Field.Index.NOT_ANALYZED));
    doc.Add(new Field("name", "Alan Smith", Field.Store.NO, Field.Index.ANALYZED));
    doc.Add(new Field("additionalData", "faster drive", Field.Store.NO, Field.Index.ANALYZED));
    writer.AddDocument(doc);

    doc = new Document();
    doc.Add(new Field("id", "3", Field.Store.YES, Field.Index.NOT_ANALYZED));
    doc.Add(new Field("name", "Mike Std", Field.Store.NO, Field.Index.ANALYZED));
    doc.Add(new Field("additionalData", "faster check", Field.Store.NO, Field.Index.ANALYZED));
    writer.AddDocument(doc);
}

var words = new[] {"John", "data", "check"};
var parser = new MultiFieldQueryParser(Lucene.Net.Util.Version.LUCENE_30, new[] {"name", "additionalData"}, analyzer);


var mainQuery = new BooleanQuery();
foreach (var word in words)
    mainQuery.Add(parser.Parse(word), Occur.SHOULD); // Should probably use parser.Parse(QueryParser.Escape(word)) instead

using (var searcher = new IndexSearcher(directory))
{
    var results = searcher.Search(mainQuery, null, int.MaxValue, Sort.RELEVANCE);
    var idFieldSelector = new MapFieldSelector("id");

    foreach (var scoreDoc in results.ScoreDocs)
    {
        var doc = searcher.Doc(scoreDoc.Doc, idFieldSelector);
        Console.WriteLine("Found: {0}", doc.Get("id"));
    }
}

Upvotes: 3

Related Questions