Reputation: 363
I am using Lucene.Net 2.9.2 and I reckon I will need to write a custom tokenizer but wanted to check in case I am missing something obvious.
The document consists of Title, Keywords and Content plus some metadata like author, date etc each stored as a field. The documents are software technical documents and may contain phrases such as '.Net', 'C++', 'C#' in the title, keywords and/or content.
I'm using the KeywordAnalyzer for the Keyword field and StandardAnalyzer for Title and Content - StopWords and LowerCase etc are necessary as the documents can be very long.
I have also written a Synonym custom filter for search as I want to search for, for example, 'C#' but also recognise 'CSharp', 'C#.Net' etc. The tokenizer has already removed the '#' from 'C#' or the '++' from C++ and therefore can be confused with, say, a 'C' language reference
My thought is that when I index Title and Content that I need to branch the tokenization depending on whether the current token is part of the keyword phrases or any of its synonyms.
Is that the best approach? Many thanks in advance :)
Upvotes: 1
Views: 2758
Reputation: 1024
The customization of tokenizer can be done with one of below mentioned classes:
1). Lucene.Net.Analysis.CharTokenizer 2). Lucene.Net.Analysis.Tokenizer
public class AlphaNumbericTokenizer : Lucene.Net.Analysis.CharTokenizer
{
public AlphaNumbericTokenizer (System.IO.TextReader input) : base(input)
{
}
protected override bool IsTokenChar(char c)
{
//TODO: Logic for identifying token or token separator
return char.IsLetterOrDigit(c);
}
}
Please refer, http://karticles.com/NoSql/lucene_custom_tokenizer.html
Upvotes: 2
Reputation: 5354
I think that you can use WhitespaceTokenizer
, then plug in a KeywordMarkerFilter
to mark some tokens as 'inviolable' and finally supply your own filter that would strip punctuation characters. Maybe someone with knowledge of Lucene.Net will suggest something; e.g. in Solr WordDelimiterFilter
could be used.
Upvotes: 1