Reputation: 11
I have an application that requires me to index a few gigabytes of sentences(about 16million lines).
Currently my search works in the following way.
My search terms is usually revolving around a phrase. For example "running in the park". I want to be able to search for sentences similar to this or contains part of these phrase. I do so by constructing smaller phrases :
"running in the" "in the park" etc.
Each of them is given a weight (the longer ones get larget weight)
At the moment, I treat each line as one document. A typical search takes about a few seconds and I am wondering if there is a way to increase the speed of the search.
On top of that, I also need to get what is matched.
For example : "I was jogging in the park this morning" matches "in the park", and I would want to know how it is matched. I know about Explainer for lucene search but is there a simpler way or is there a resource that i can get to learn how to extract the information I want from Lucene's Explainer.
I am currently using regex to get the match term. It is fast but are inaccurate as lucene sometimes ignore punctuations and other stuffs and I can't be handling all the special cases.
Upvotes: 1
Views: 4166
Reputation: 2384
SpanQueries might help you to find where was query matched in sentence: https://lucene.apache.org/core/6_2_0/core/org/apache/lucene/search/spans/package-summary.html
Using this you get exact locations from query: How to get the matching spans of a Span Term Query in Lucene 5?
Upvotes: 0
Reputation: 159
Highlighter is better than Explainer, it is faster. You can extract the matched phrases between tags after high light them. Java regex to extract text between tags
public class HighlightDemo {
Directory directory;
Analyzer analyzer;
String[] contents = {"running in the park",
"I was jogging in the park this morning",
"running on the road",
"The famous New York Marathon has its final miles in Central park every year and it's easy to understand why: the park, with a variety of terrain and excellent scenery, is the ultimate runner's dream. With its many paths that range in level of difficulty, Central Park allows a runner to experience clarity and freedom in this picturesque urban oasis."};
@Before
public void setUp() throws IOException {
directory = new RAMDirectory();
analyzer = new WhitespaceAnalyzer();
// indexed documents
IndexWriter writer = new IndexWriter(directory, analyzer, IndexWriter.MaxFieldLength.UNLIMITED);
for (int i = 0; i < contents.length; i++) {
Document doc = new Document();
doc.add(new Field("content", contents[i], Field.Store.NO, Field.Index.ANALYZED)); // store & index
doc.add(new NumericField("id", Field.Store.YES, true).setIntValue(i)); // store & index
writer.addDocument(doc);
}
writer.close();
}
@Test
public void test() throws IOException, ParseException, InvalidTokenOffsetsException {
IndexSearcher s = new IndexSearcher(directory);
QueryParser parser = new QueryParser(Version.LUCENE_36, "content", analyzer);
org.apache.lucene.search.Query query = parser.parse("park");
TopDocs hits = s.search(query, 10);
SimpleHTMLFormatter htmlFormatter = new SimpleHTMLFormatter();
Highlighter highlighter = new Highlighter(htmlFormatter, new QueryScorer(query));
for (int i = 0; i < hits.scoreDocs.length; i++) {
int id = hits.scoreDocs[i].doc;
Document doc = s.doc(id);
String text = contents[Integer.parseInt(s.doc(id).get("id"))];
TokenStream tokenStream = analyzer.tokenStream("content", new StringReader(text));
org.apache.lucene.search.highlight.TextFragment[] frag = highlighter.getBestTextFragments(tokenStream, text, false, 10);
for (int j = 0; j < frag.length; j++) {
if ((frag[j] != null) && (frag[j].getScore() > 0)) {
assertTrue(frag[j].toString().contains("<B>"));
assertTrue(frag[j].toString().contains("</B>"));
System.out.println(frag[j].toString());
}
}
}
}
}
Upvotes: 3
Reputation: 5693
Lucene's "contrib" module Highlighter will let you extract what was matched by Lucene.
Upvotes: 2