Reputation: 148
I am new to the Apache Lucene. I am using the Apache Lucene v7.2.1. I need to do a phrase search in a huge file. I first made a sample code to figure out phrase search functionality in the Lucene using PhraseQuery. But it does not work. My code is given below:
public class LuceneExample
{
private static final String INDEX_DIR = "myIndexDir";
// function to create index writer
private static IndexWriter createWriter() throws IOException
{
FSDirectory dir = FSDirectory.open(Paths.get(INDEX_DIR));
IndexWriterConfig config = new IndexWriterConfig(new StandardAnalyzer());
IndexWriter writer = new IndexWriter(dir, config);
return writer;
}
// function to create the index document.
private static Document createDocument(Integer id, String source, String target)
{
Document document = new Document();
document.add(new StringField("id", id.toString() , Store.YES));
document.add(new TextField("source", source , Store.YES));
document.add(new TextField("target", target , Store.YES));
return document;
}
// function to do index search by source
private static TopDocs searchBySource(String source, IndexSearcher searcher) throws Exception
{
// phrase query build
PhraseQuery.Builder builder = new PhraseQuery.Builder();
String[] words = source.split(" ");
int ii = 0;
for (String word : words) {
builder.add(new Term("source", word), ii);
ii = ii + 1;
}
PhraseQuery query = builder.build();
System.out.println(query);
// phrase search
TopDocs hits = searcher.search(query, 10);
return hits;
}
public static void main(String[] args) throws Exception
{
// TODO Auto-generated method stub
// create index writer
IndexWriter writer = createWriter();
//create documents object
List<Document> documents = new ArrayList<>();
String src = "Negotiation Skills are focused on resolving differences for the benefit of an individual or a group , or to satisfy various interests.";
String tgt = "Modified target : Negotiation Skills are focused on resolving differences for the benefit of an individual or a group, or to satisfy various interests.";
Document d1 = createDocument(1, src, tgt);
documents.add(d1);
src = "This point may benefit all of the participating entities, or just a single party, some of them, or all of them.";
tgt = "Modified target : This point may benefit all of the participating entities, or just a single party, some of them, or all of them.";
Document d2 = createDocument(2, src, tgt);
documents.add(d2);
writer.deleteAll();
// adding documents to index writer
writer.addDocuments(documents);
writer.commit();
writer.close();
// for index searching
Directory dir = FSDirectory.open(Paths.get(INDEX_DIR));
IndexReader reader = DirectoryReader.open(dir);
IndexSearcher searcher = new IndexSearcher(reader);
//Search by source
TopDocs foundDocs = searchBySource("benefit of an individual", searcher);
System.out.println("Total Results count :: " + foundDocs.totalHits);
}
}
When I searched for the string "benefit of an individual" as mentioned above. The Total Results count comes as '0' . But it is present in the document1. It would be great if I could get any help in resolving this issue. Thanks in advance.
Upvotes: 2
Views: 1107
Reputation: 2924
Let's start from the summary:
There is a rule use the same analysis chain at index and query time.
Here is an example of a simplified and "correct" query processing:
// function to do index search by source
private static TopDocs searchBySource(String source, IndexSearcher searcher) throws Exception {
// phrase query build
PhraseQuery.Builder builder = new PhraseQuery.Builder();
TokenStream tokenStream = new StandardAnalyzer().tokenStream("source", source);
tokenStream.reset();
while (tokenStream.incrementToken()) {
CharTermAttribute charTermAttribute = tokenStream.getAttribute(CharTermAttribute.class);
builder.add(new Term("source", charTermAttribute.toString()));
}
tokenStream.end();
tokenStream.close();
builder.setSlop(2);
PhraseQuery query = builder.build();
System.out.println(query);
// phrase search
TopDocs hits = searcher.search(query, 10);
return hits;
}
In sake of simplicity we can remove stop words from Standard analyzer, by using constructor with empty stop words list - and everything will be simple as you expected. You can read more about stop words and phrase queries here.
All the problems with phrase queries are started from stop words. Under the hood Lucene keeps positions of all words including stop words in a special index - term positions. It is useful in some cases to separate "the goal" and "goal". In case of phrase query - it tries to take into account term positions. For example, we have a term "black and white" with a stop word "and". In this case Lucene index will have two terms "black" with position 1 and "white" with position 3. Naive phrase query "black white" should not match anything because it does not allow gap in terms positions. There are two possible strategies to create the right query:
In order to create the right query you can use the following term attribute at query processing:
PositionIncrementAttribute positionIncrementAttribute = tokenStream.getAttribute(PositionIncrementAttribute.class);
I've used setSlop(2)
in order to simplify a code snippet, you can set slop factor based on query length or put correct positions of terms in phrase builder. My recommendation is not to use stop words, you can read about stop words here.
Upvotes: 4