swaechter
swaechter

Reputation: 1439

How to use QueryParser for Lucene range queries (IntPoint/LongPoint)

One thing I really like about Lucene is the query language where I/an application user) can write dynamic queries. I parse these queries via

QueryParser parser = new QueryParser("", indexWriter.getAnalyzer());
Query query = parser.parse("id:1 OR id:3");

But this does not work for range queries like these one:

Query query = parser.parse("value:[100 TO 202]"); // Returns nothing
Query query = parser.parse("id:1 OR value:167"); // Returns only document with ID 1 and not 1 

On the other hand, via API it works (But I give up the convenient way to just use the query as input):

Query query = LongPoint.newRangeQuery("value", 100L, 202L); // Returns 1, 2 and 3

Is this a bug in query parser or do I miss an important point, like QueryParser takes the lexical and not numerical value? How can I chance this without using the query API but parsing the string?

The question is a follow up to this question that pointed out the problem, but not the reason: Lucene LongPoint Range search doesn't work

Full code:

package acme.prod;

import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.*;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.queryparser.classic.QueryParser;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.RAMDirectory;

import java.util.Arrays;
import java.util.List;
import java.util.UUID;

public class LuceneRangeExample {

    public static void main(String[] arguments) throws Exception {
        // Create the index
        Directory searchDirectoryIndex = new RAMDirectory();
        IndexWriter indexWriter = new IndexWriter(searchDirectoryIndex, new IndexWriterConfig(new StandardAnalyzer()));

        // Add several documents that have and ID and a value
        List<Long> values = Arrays.asList(23L, 145L, 167L, 201L, 20100L);
        int counter = 0;
        for (Long value : values) {
            Document document = new Document();
            document.add(new StringField("id", Integer.toString(counter), Field.Store.YES));
            document.add(new LongPoint("value", value));
            document.add(new StoredField("value", Long.toString(value)));
            indexWriter.addDocument(document);
            indexWriter.commit();
            counter++;
        }

        // Create the reader and search for the range 100 to 200
        IndexReader indexReader = DirectoryReader.open(indexWriter);
        IndexSearcher indexSearcher = new IndexSearcher(indexReader);
        QueryParser parser = new QueryParser("", indexWriter.getAnalyzer());
//        Query query = parser.parse("id:1 OR value:167");
//        Query query = parser.parse("value:[100 TO 202]");
        Query query = LongPoint.newRangeQuery("value", 100L, 202L);
        TopDocs hits = indexSearcher.search(query, 100);
        for (int i = 0; i < hits.scoreDocs.length; i++) {
            int docid = hits.scoreDocs[i].doc;
            Document document = indexSearcher.doc(docid);
            System.out.println("ID: " + document.get("id") + " with range value " + document.get("value"));
        }
    }
}

Upvotes: 2

Views: 3009

Answers (1)

andrewJames
andrewJames

Reputation: 21902

I think there are a few different things to note here:

1. Using the classic parser

As you show in your question, the classic parser supports range searches, as documented here. But the key point to note in the documentation is:

Sorting is done lexicographically.

That is to say, it uses text-based sorting to determine whether a field's values are within the range or not.

However, your field is a LongPoint field (again, as you show in your code). This field stores your data as an array of longs, as shown in the constructor.

This is not lexicographical data - and even when you only have one value, it's not handled as string data.

I assume that this is why the following queries do not work as expected - but I am not 100% sure of this, because I did not find any documentation confirming this:

Query query = parser.parse("id:1 OR value:167");
Query query = parser.parse("value:[100 TO 202]");

(I am slightly surprised that these queries do not throw errors).

2. Using a LongPoint Query

As you have also shown, you can use one of the specialized LongPoint queries to get the results you expect - in your case, you used LongPoint.newRangeQuery("value", 100L, 202L);.

But as you also note, you lose the benefits of the classic parser syntax.

3. Using the Standard Query Parser

This may be a good approach which allows you to continue using your preferred syntax, while also supporting number-based range searches.

The StandardQueryParser is a newer alternative to the classic parser, but it uses the same syntax as the classic parser by default.

This parser lets you configure a "points config map", which tells the parser which fields to handle as numeric data, for operations such as range searches.

For example:

import org.apache.lucene.queryparser.flexible.standard.StandardQueryParser;
import org.apache.lucene.queryparser.flexible.standard.config.PointsConfig;
import java.text.DecimalFormat;
import java.util.Map;
import java.util.HashMap;

...

StandardQueryParser parser = new StandardQueryParser();
parser.setAnalyzer(indexWriter.getAnalyzer());

// Here I am just using the default decimal format - but you can provide
// a specific format string, as needed:
PointsConfig pointsConfig = new PointsConfig(new DecimalFormat(), Long.class);
Map<String, PointsConfig> pointsConfigMap = new HashMap<>();
pointsConfigMap.put("value", pointsConfig);
parser.setPointsConfigMap(pointsConfigMap);

Query query1 = parser.parse("value:[101 TO 203]", "");

Running your index searcher code with the above query gives the following output:

ID: 1 with range value 145
ID: 2 with range value 167
ID: 3 with range value 201

Note that this correctly excludes the 20100L value (which would be included if the query was using lexical sorting).

I don't know of any way to get the same results using only the classic query parser - but at least this is using the same query syntax that you would prefer to use.

Upvotes: 5

Related Questions