PhraseQuery is not working in Apache lucene 7.2.1

Question

I am new to the Apache Lucene. I am using the Apache Lucene v7.2.1. I need to do a phrase search in a huge file. I first made a sample code to figure out phrase search functionality in the Lucene using PhraseQuery. But it does not work. My code is given below:

public class LuceneExample 
{

  private static final String INDEX_DIR = "myIndexDir";
  // function to create index writer
  private static IndexWriter createWriter() throws IOException
  {
    FSDirectory dir = FSDirectory.open(Paths.get(INDEX_DIR));
    IndexWriterConfig config = new IndexWriterConfig(new StandardAnalyzer());
    IndexWriter writer = new IndexWriter(dir, config);
    return writer;
  }
// function to create the index document.
  private static Document createDocument(Integer id, String source, String target)
  {
    Document document = new Document();
    document.add(new StringField("id", id.toString() , Store.YES));
    document.add(new TextField("source", source , Store.YES));
    document.add(new TextField("target", target , Store.YES));
    return document;
  }

  // function to do index search by source
  private static TopDocs searchBySource(String source, IndexSearcher searcher) throws Exception
  {        
      // phrase query build
    PhraseQuery.Builder builder = new PhraseQuery.Builder();
    String[] words = source.split(" ");
    int ii = 0;
    for (String word : words) {
        builder.add(new Term("source", word), ii);
        ii = ii + 1;
    }
    PhraseQuery query = builder.build();
    System.out.println(query);
    // phrase search
    TopDocs hits = searcher.search(query, 10);
    return hits;
  }

  public static void main(String[] args) throws Exception 
  {
    // TODO Auto-generated method stub
    // create index writer
    IndexWriter writer = createWriter();
    //create documents object
    List documents = new ArrayList<>();

    String src = "Negotiation Skills are focused on resolving differences for the benefit of an individual or a group , or to satisfy various interests.";
    String tgt = "Modified target : Negotiation Skills are focused on resolving differences for the benefit of an individual or a group, or to satisfy various interests.";
    Document d1 = createDocument(1, src, tgt);
    documents.add(d1);

    src = "This point may benefit all of the participating entities, or just a single party, some of them, or all of them.";
    tgt = "Modified target : This point may benefit all of the participating entities, or just a single party, some of them, or all of them.";
    Document d2 = createDocument(2, src, tgt);
    documents.add(d2);

    writer.deleteAll();

    // adding documents to index writer
    writer.addDocuments(documents);
    writer.commit();
    writer.close();

    // for index searching

    Directory dir = FSDirectory.open(Paths.get(INDEX_DIR));
    IndexReader reader = DirectoryReader.open(dir);
    IndexSearcher searcher = new IndexSearcher(reader);

    //Search by source
    TopDocs foundDocs = searchBySource("benefit of an individual", searcher);
    System.out.println("Total Results count :: " + foundDocs.totalHits);
  }

}

When I searched for the string "benefit of an individual" as mentioned above. The Total Results count comes as '0' . But it is present in the document1. It would be great if I could get any help in resolving this issue. Thanks in advance.

Ivan Mamontov · Accepted Answer

Let's start from the summary:

at index time you are using Standard analyzer with English stop words
at query time you are using your own analysis without stop words and special characters removal

There is a rule use the same analysis chain at index and query time.

Here is an example of a simplified and "correct" query processing:

  // function to do index search by source
  private static TopDocs searchBySource(String source, IndexSearcher searcher) throws Exception {
    // phrase query build
    PhraseQuery.Builder builder = new PhraseQuery.Builder();
    TokenStream tokenStream = new StandardAnalyzer().tokenStream("source", source);
    tokenStream.reset();
    while (tokenStream.incrementToken()) {
      CharTermAttribute charTermAttribute = tokenStream.getAttribute(CharTermAttribute.class);
      builder.add(new Term("source", charTermAttribute.toString()));
    }
    tokenStream.end();
    tokenStream.close();
    builder.setSlop(2);
    PhraseQuery query = builder.build();
    System.out.println(query);
    // phrase search
    TopDocs hits = searcher.search(query, 10);
    return hits;
  }

In sake of simplicity we can remove stop words from Standard analyzer, by using constructor with empty stop words list - and everything will be simple as you expected. You can read more about stop words and phrase queries here.

All the problems with phrase queries are started from stop words. Under the hood Lucene keeps positions of all words including stop words in a special index - term positions. It is useful in some cases to separate "the goal" and "goal". In case of phrase query - it tries to take into account term positions. For example, we have a term "black and white" with a stop word "and". In this case Lucene index will have two terms "black" with position 1 and "white" with position 3. Naive phrase query "black white" should not match anything because it does not allow gap in terms positions. There are two possible strategies to create the right query:

"black ? white" - uses special marker for every stop word. This will match "black and white" and "black or white"
"black white"~1 - allows to match with gap in terms positions. "black or white" is also possible. With slop 2 and more "white and black" is also possible.

In order to create the right query you can use the following term attribute at query processing:

PositionIncrementAttribute positionIncrementAttribute = tokenStream.getAttribute(PositionIncrementAttribute.class);

I've used setSlop(2) in order to simplify a code snippet, you can set slop factor based on query length or put correct positions of terms in phrase builder. My recommendation is not to use stop words, you can read about stop words here.

PhraseQuery is not working in Apache lucene 7.2.1

Answers (1)

Related Questions