eslsys
eslsys

Reputation: 363

Tokenizing Keywords in Lucene.Net

I am using Lucene.Net 2.9.2 and I reckon I will need to write a custom tokenizer but wanted to check in case I am missing something obvious.

The document consists of Title, Keywords and Content plus some metadata like author, date etc each stored as a field. The documents are software technical documents and may contain phrases such as '.Net', 'C++', 'C#' in the title, keywords and/or content.

I'm using the KeywordAnalyzer for the Keyword field and StandardAnalyzer for Title and Content - StopWords and LowerCase etc are necessary as the documents can be very long.

I have also written a Synonym custom filter for search as I want to search for, for example, 'C#' but also recognise 'CSharp', 'C#.Net' etc. The tokenizer has already removed the '#' from 'C#' or the '++' from C++ and therefore can be confused with, say, a 'C' language reference

My thought is that when I index Title and Content that I need to branch the tokenization depending on whether the current token is part of the keyword phrases or any of its synonyms.

Is that the best approach? Many thanks in advance :)

Upvotes: 1

Views: 2758

Answers (2)

vrluckyin
vrluckyin

Reputation: 1024

The customization of tokenizer can be done with one of below mentioned classes:

1). Lucene.Net.Analysis.CharTokenizer 2). Lucene.Net.Analysis.Tokenizer

public class AlphaNumbericTokenizer : Lucene.Net.Analysis.CharTokenizer
{
     public AlphaNumbericTokenizer (System.IO.TextReader input) : base(input)
     {
     }
     protected override bool IsTokenChar(char c)
     {
       //TODO: Logic for identifying token or token separator
       return char.IsLetterOrDigit(c);
     }
}

Please refer, http://karticles.com/NoSql/lucene_custom_tokenizer.html

Upvotes: 2

Artur Nowak
Artur Nowak

Reputation: 5354

I think that you can use WhitespaceTokenizer, then plug in a KeywordMarkerFilter to mark some tokens as 'inviolable' and finally supply your own filter that would strip punctuation characters. Maybe someone with knowledge of Lucene.Net will suggest something; e.g. in Solr WordDelimiterFilter could be used.

Upvotes: 1

Related Questions