Reputation: 31
I am trying to highlight terms in a string. My code searches along a string and looks for equivalent terms in an index. The code returns found terms ok. However, I would like to return the original string, to the user, that was inputted by the user with found terms highlighted. I am using Lucene 4 because that is the book I am using to learn Lucene. I have a pitiful attempt to get term vectors and such but it iterates through the entire field, I can't figure out how to just get the found terms.. Here is my code:
public class TokenArrayTest {
private static final String INDEX_DIR = "C:/ontologies/Lucene/icnpIndex";
//private static List<Float> levScore = new ArrayList<Float>();
//add key and value pairs of tokens to a map to send to a servlet. key 10,11,12 etc
//private static HashMap<Integer, String> hashMap = new HashMap<Integer, String>();
private static List<String> tokens = new ArrayList<String>();
private static int totalResults=0;
public static void main(String[] pArgs) throws IOException, ParseException, InvalidTokenOffsetsException
{
//counters which detect found term changes to advance the html table to the next cell
int b=1;
int c=1;
String searchText="Mrs. smith has limited mobility and fell out of bed. She needs a feeding assessment. She complained of abdominal pains nuring the night. She woke with a headache and she is due for a shower this morning.";
//Get directory reference
Directory dir = FSDirectory.open(new File(INDEX_DIR));
//Index reader - an interface for accessing a point-in-time view of a lucene index
IndexReader reader = DirectoryReader.open(dir);
//Create lucene searcher. It search over a single IndexReader.
IndexSearcher searcher = new IndexSearcher(reader);
//analyzer with the default stop words
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_40);
TokenStream tokenStream = analyzer.tokenStream(null, new StringReader(searchText));
CharTermAttribute termAttribute = tokenStream.getAttribute(CharTermAttribute.class);
//Query parser to be used for creating TermQuery
QueryParser qp = new QueryParser(Version.LUCENE_40, "Preferred Term", analyzer);
/*add all of the words to an array after they have passed through the analyzer.
* The words are used one by one through the query method later on.
*/
while (tokenStream.incrementToken()) {
tokens.add(termAttribute.toString());
}
//print the top half of the html page
System.out.print("<html>\r\n" +
"\r\n" +
"<head>\r\n" +
"<meta http-equiv=\"Content-Type\" content=\"text/html; charset=windows-1252\">\r\n" +
"\r\n" +
"<title>ICNP results</title>\r\n" +
"</head>\r\n" +
"\r\n" +
"<body>\r\n" +
"\r\n" +
"<p>"+
searchText+"<br>"+
"<p>"+
"<div align=\"center\">\r\n" +
" <center>\r\n" +
" <table border=\"1\" \r\n" +
" <tr>\r\n" +
"<td>\r\n"+
"");
//place each word from the previous array into the query
for(int n=0;n<tokens.size();++n) {
//Create the query
Query query = qp.parse(tokens.get(n));
//Search the lucene documents for the hits
TopDocs hits = searcher.search(query, 20);
//Total found documents
totalResults =totalResults+hits.totalHits;
//print out the score for each searched term
//for (ScoreDoc sd : hits.scoreDocs)
//{
//Document d = searcher.doc(sd.doc);
// System.out.println("Score : " + sd.score);
// }
/** Highlighter Code Start ****/
//Put a html code in here for each found term if need be
Formatter formatter = new SimpleHTMLFormatter("", "");
//Scores text fragments by the number of unique query terms found
QueryScorer scorer = new QueryScorer(query);
//used to markup highlighted terms found in the best sections of a text
Highlighter highlighter = new Highlighter(formatter, scorer);
//It breaks text up into same-size texts but does not split up spans
Fragmenter fragmenter = new SimpleSpanFragmenter(scorer, 20);
//set fragmenter to highlighter
highlighter.setTextFragmenter(fragmenter);
//Iterate over found results
for (int i = 0; i < hits.scoreDocs.length; i++)
{
int docid = hits.scoreDocs[i].doc;
Document doc = searcher.doc(docid);
//Get stored text from found document
String text = doc.get("Preferred Term");
//a pitiful attempt to get term vectors and such like
termsVector = reader.getTermVector(i, "Preferred Term");
termsEnum = termsVector.iterator(termsEnum);
while ( (term = termsEnum.next()) != null ) {
val = term.utf8ToString();
System.out.println("DocId: " + i);
System.out.println(" term: " + val);
System.out.println(" length: " + term.length);
docsAndPositionsEnum = termsEnum.docsAndPositions(null, docsAndPositionsEnum);
if (docsAndPositionsEnum.nextDoc() >= 0) {
int freq = docsAndPositionsEnum.freq();
System.out.println(" freq: " + docsAndPositionsEnum.freq());
for (int j = 0; j < freq; j++) {
System.out.println(" [");
System.out.println(" position: " + docsAndPositionsEnum.nextPosition());
System.out.println(" offset start: " + docsAndPositionsEnum.startOffset());
System.out.println(" offset end: " + docsAndPositionsEnum.endOffset());
System.out.println(" ]");
}
}
}
//Create token stream
TokenStream stream = TokenSources.getAnyTokenStream(reader, docid, "Preferred Term", analyzer);
//Get highlighted text fragments
String[] frags = highlighter.getBestFragments(stream, text,20);
for (String frag : frags)
{
//On the first pass print this html out
if((c==1)&&(b!=c)) {
System.out.println("<select>");
c=b;
}else if((b!=c)) { //and every other time move to the next cell when b changes
System.out.println("</select>"
+ "</td><td>"
+ "<select>");
c=b;
}
System.out.println("<option value='"+frag+"'>"+frag+"</option>");
}
}
b=b+1;
}
dir.close();
b=1;
c=1;
totalResults=0;
//print the bottom half of the html page
System.out.print("</select></td>\r\n" +
" </tr>\r\n" +
" </table>\r\n" +
" </center>\r\n" +
"</div>\r\n" +
"\r\n" +
"</body>\r\n" +
"\r\n" +
"</html>\r\n" +
"");
}
}
Upvotes: 1
Views: 635
Reputation: 31
For future readers, here is my Plain old java method (POJM) that prints out offsets.
generatePreviewText( analyzer, searchText, tokens, frags );
public static void generatePreviewText(Analyzer analyzer, String inputText, List<String> tokens, String[] frags) throws IOException
{
String contents[]= {inputText};
String[] foundTerms = frags;
//for(int n=0;n<frags.length;++n) {
//System.out.println("Found terms array= "+foundTerms[n]);
// }
Directory directory = new RAMDirectory();
IndexWriterConfig config =
new IndexWriterConfig(Version.LUCENE_40, analyzer);
IndexWriter indexWriter = new IndexWriter(directory, config);
FieldType textFieldType = new FieldType();
textFieldType.setIndexed(true);
textFieldType.setTokenized(true);
textFieldType.setStored(true);
textFieldType.setStoreTermVectors(true);
textFieldType.setStoreTermVectorPositions(true);
textFieldType.setStoreTermVectorOffsets(true);
Document doc = new Document();
Field textField = new Field("content", "", textFieldType);
for (String content : contents) {
textField.setStringValue(content);
doc.removeField("content");
doc.add(textField);
indexWriter.addDocument(doc);
}
indexWriter.commit();
IndexReader indexReader = DirectoryReader.open(directory);
DocsAndPositionsEnum docsAndPositionsEnum = null;
Terms termsVector = null;
TermsEnum termsEnum = null;
BytesRef term = null;
String val = null;
for (int i = 0; i < indexReader.maxDoc(); i++) {
termsVector = indexReader.getTermVector(i, "content");
termsEnum = termsVector.iterator(termsEnum);
while ( (term = termsEnum.next()) != null ) {
val = term.utf8ToString();
// if(foundTerms.get(i)==val) {
System.out.println(" term: " + val);
System.out.println(" length: " + term.length);
docsAndPositionsEnum = termsEnum.docsAndPositions(null, docsAndPositionsEnum);
if (docsAndPositionsEnum.nextDoc() >= 0) {
int freq = docsAndPositionsEnum.freq();
System.out.println(" freq: " + docsAndPositionsEnum.freq());
for (int j = 0; j < freq; j++) {
System.out.println(" [");
System.out.println(" position: " + docsAndPositionsEnum.nextPosition());
System.out.println(" offset start: " + docsAndPositionsEnum.startOffset());
System.out.println(" offset end: " + docsAndPositionsEnum.endOffset());
System.out.println(" ]");
}
}
//} }
}indexWriter.close();
}
Upvotes: 0
Reputation: 763
I Don't know if possible with lucene v4 but with newer versions it's easy possible with a Highlighter
a UnifiedHighlighter
.
There are several tutorials in which text highlighting is achieved on different ways (just google it...):
If you start with a new project i would strongly suggest using the most recent version even if your book is based on lucene v4. The book is good to get a basic understanding about how lucene works but using an old version of the library is an instant technical dept which you habe to deal later on. Additional to this a newer version usually provides additional features which may be interesting for you.
Upvotes: 1