How do I get a list of found words using Lucene.Net?

Question

I have indexed documents. They have content:

Document 1:

Green table stood in the room. The room was small.

Document 2:

Green tables stood in the room. The room was large.

I'm looking for "green table". I will find Document1 and Document2. I want to show which phrases were found. I found in first document - "green table". I found in second document - "greens table". How will I get list of founds words ("green table" and "greens table")? I'm using Lucene.Net version 3.0.3.

Omri · Accepted Answer

You can use the Highlighter to mark the "found words". If you want to find them for another reason you can still use the Highlighter and then using a regex (or a simple substring loop) to extract the words.

For example:

Query objQuery = new TermQuery(new Term("content", strQuery));

QueryScorer scorer = new QueryScorer(objQuery , "content");

SimpleHTMLFormatter formatter = new SimpleHTMLFormatter("","");

highlighter = new Highlighter(formatter, scorer);
highlighter.TextFragmenter = new SimpleFragmenter(9999);

for (int i = 0; i < topRealtedDocs.ScoreDocs.Length; i++)
{
     TokenStream stream = TokenSources.GetAnyTokenStream(searcher.IndexReader, topRealtedDocs.ScoreDocs[i].Doc, "content", analyzer);

     string strSnippet = highlighter.GetBestFragment(stream, doc.GetValue("content"));

     // here you can do what you want with the snippet. add it to your result or for example extract the words (not with a regex - this is just an example from here! use what ever you need):
     List foundPhrases = new List();
     while (strSnippet.IndexOf("") > -1)
     {
          int indexStart = strSnippet.IndexOf("");
          int indexEnd = strSnippet.IndexOf("");

          foundPhrases.Add(strSnippet.Substring(indexStart, indexEnd - indexStart));

          strSnippet = strSnippet.Substring(indexEnd);

     }
}

Omri

How do I get a list of found words using Lucene.Net?

Answers (1)

Related Questions