Chris W
Chris W

Reputation: 1802

Lucene.net parse Fields to create a LIKE equivalent search

I've found Lucene to be fantastic so far, but I'm having a few issues with duplicating a LIKE equivalent search.

In an application I'm working on I need the option of a "simplified" (LIKE) search and an advanced (full text) search. The data is user based (name, location etc) so not huge reams of text.

In the past I'd simply create a SQL query which concatenated db field names, surrounding the terms with wildcards. I could do that in my application, bypassing lucene for simple searches of the user data - but it would be nice to use lucene.

I've tried regex searches

var query = QueryParser.Escape(_query);
var search = new RegexQuery(new Term("name",string.Concat(".*", _query, ".*")));

but they only work on one column.

One idea I had was to tokenise each field to produce something similar to a full text search e.g:

name: Paul

so I create the following name fields...

Paul Pau Pa aul ul au

Would this defeat the point of using lucene over a LIKE SQL search? Would it actually produce the results I want?

What would be the best way to solve this issue?

Edit:

Slightly modifying the code in this question:

Elegant way to split string into 2 strings on word boundaries to minimize length difference

to produce this tokeniser:

    private IEnumerable<string> Tokeniser(string _item)
    {
        string s = _item;

        const int maxPrefixLength = 10;
        const int maxSuffixLength = 10;
        const int minStemLength = 1;

        var tokens = new List<string>();


        for (int prefixLength = 0; (prefixLength + minStemLength <= s.Length) && (prefixLength <= maxPrefixLength); prefixLength++)
            for (int suffixLength = 0; (suffixLength + prefixLength + minStemLength <= s.Length) && (suffixLength <= maxSuffixLength); suffixLength++)
            {
                string prefix = s.Substring(0, prefixLength);
                string suffix = s.Substring(s.Length - suffixLength);
                string stem = s.Substring(prefixLength, s.Length - suffixLength - prefixLength);

                if (prefix.Length > 1)
                    if (!tokens.Contains(prefix))
                        tokens.Add(prefix);

                if (suffix.Length > 1)
                    if (!tokens.Contains(suffix))
                        tokens.Add(suffix);

                if (stem.Length > 1)
                    if (!tokens.Contains(stem))
                        tokens.Add(stem);
            }

        return tokens;
    }

The search results do give the equivalent of a LIKE search. My "user" table will only ever be 9000 entities in size - so for me at least, this might fit my needs.

Are there any downfalls of doing this (except for a much larger lucene index?)

Upvotes: 1

Views: 590

Answers (1)

Mikos
Mikos

Reputation: 8553

Character-based n-gram (NGramTokenizer, NGramTokenFilter, EdgeNGramTokenizer and EdgeNGramTokenFilter should provide the functionality you need.

Upvotes: 1

Related Questions