Reputation: 4226
I have a database table with about 40,000 records containing code fields, such as FLEFSU25B-25M EMG1090-5S
I need to be able to very quickly select all codes that contain a given substring. For example "109" matches EMG1090-5S.
My current approach is to store the codes in Lucene and have Lucene filter by substring - such as 109 But that is not very efficient if I just store the codes, because than Lucene has to search through all the tokens.
To overcome this, I'm thinking of creating a new analyzer that will split each code into tokens, like this:
EMG1090-5S
MG1090-5S
G1090-5S
1090-5S
...
Then to find all codes with substring 109, I can search on 109* which is much more efficient (I understand Lucene stores tokens alphabetically, just like SQL Server indexes).
Does this make sense? Does such an analyzer already exist? I'm using .Net/C#.
Upvotes: 1
Views: 1645
Reputation: 33351
A token filter to accomplish this does indeed already exist! Take a look at EdgeNGramTokenFilter. An Analyzer
using it might look something like:
Analyzer analyzer = new Analyzer() {
@Override
protected TokenStreamComponents createComponents(String fieldName, Reader reader) {
KeywordTokenizer source = new KeywordTokenizer(reader);
LowercaseFilter filter = new LowercaseFilter(source);
filter = new EdgeNGramTokenFilter(filter, EdgeNGramTokenFilter.Side.BACK, 2, 50);
return new TokenStreamComponents(source, filter);
}
};
For your consideration, WordDelimiterTokenizer
might also prove useful to you. It has a number of configuartion options, and can be used to separate at punctuation and at transitions from letter to number, etc. So with it, you could get the from your input: "EMG1090-5S"
You could get the tokens:
Which might work well for your case, but would not be particularly helpful in finding something like: "MG1"
Upvotes: 1