Tokenizing Keywords in Lucene.Net

Question

I am using Lucene.Net 2.9.2 and I reckon I will need to write a custom tokenizer but wanted to check in case I am missing something obvious.

The document consists of Title, Keywords and Content plus some metadata like author, date etc each stored as a field. The documents are software technical documents and may contain phrases such as '.Net', 'C++', 'C#' in the title, keywords and/or content.

I'm using the KeywordAnalyzer for the Keyword field and StandardAnalyzer for Title and Content - StopWords and LowerCase etc are necessary as the documents can be very long.

I have also written a Synonym custom filter for search as I want to search for, for example, 'C#' but also recognise 'CSharp', 'C#.Net' etc. The tokenizer has already removed the '#' from 'C#' or the '++' from C++ and therefore can be confused with, say, a 'C' language reference

My thought is that when I index Title and Content that I need to branch the tokenization depending on whether the current token is part of the keyword phrases or any of its synonyms.

Is that the best approach? Many thanks in advance :)

Artur Nowak · Accepted Answer

I think that you can use WhitespaceTokenizer, then plug in a KeywordMarkerFilter to mark some tokens as 'inviolable' and finally supply your own filter that would strip punctuation characters. Maybe someone with knowledge of Lucene.Net will suggest something; e.g. in Solr WordDelimiterFilter could be used.

Tokenizing Keywords in Lucene.Net

Answers (2)

Related Questions