Reputation: 1802
I've found Lucene to be fantastic so far, but I'm having a few issues with duplicating a LIKE equivalent search.
In an application I'm working on I need the option of a "simplified" (LIKE) search and an advanced (full text) search. The data is user based (name, location etc) so not huge reams of text.
In the past I'd simply create a SQL query which concatenated db field names, surrounding the terms with wildcards. I could do that in my application, bypassing lucene for simple searches of the user data - but it would be nice to use lucene.
I've tried regex searches
var query = QueryParser.Escape(_query);
var search = new RegexQuery(new Term("name",string.Concat(".*", _query, ".*")));
but they only work on one column.
One idea I had was to tokenise each field to produce something similar to a full text search e.g:
name: Paul
so I create the following name fields...
Paul Pau Pa aul ul au
Would this defeat the point of using lucene over a LIKE SQL search? Would it actually produce the results I want?
What would be the best way to solve this issue?
Edit:
Slightly modifying the code in this question:
Elegant way to split string into 2 strings on word boundaries to minimize length difference
to produce this tokeniser:
private IEnumerable<string> Tokeniser(string _item)
{
string s = _item;
const int maxPrefixLength = 10;
const int maxSuffixLength = 10;
const int minStemLength = 1;
var tokens = new List<string>();
for (int prefixLength = 0; (prefixLength + minStemLength <= s.Length) && (prefixLength <= maxPrefixLength); prefixLength++)
for (int suffixLength = 0; (suffixLength + prefixLength + minStemLength <= s.Length) && (suffixLength <= maxSuffixLength); suffixLength++)
{
string prefix = s.Substring(0, prefixLength);
string suffix = s.Substring(s.Length - suffixLength);
string stem = s.Substring(prefixLength, s.Length - suffixLength - prefixLength);
if (prefix.Length > 1)
if (!tokens.Contains(prefix))
tokens.Add(prefix);
if (suffix.Length > 1)
if (!tokens.Contains(suffix))
tokens.Add(suffix);
if (stem.Length > 1)
if (!tokens.Contains(stem))
tokens.Add(stem);
}
return tokens;
}
The search results do give the equivalent of a LIKE search. My "user" table will only ever be 9000 entities in size - so for me at least, this might fit my needs.
Are there any downfalls of doing this (except for a much larger lucene index?)
Upvotes: 1
Views: 590
Reputation: 8553
Character-based n-gram (NGramTokenizer, NGramTokenFilter, EdgeNGramTokenizer and EdgeNGramTokenFilter should provide the functionality you need.
Upvotes: 1