Matching a large number of strings/phrases

Question

I need to implement a process, wherein a text file of roughly 50/150kb is uploaded, and matched against a large number of phrases (~10k).

I need to know which phrases match specifically.

A phrase could be "blah blah blah" or just "blah" - meaning I need to take word-boundaries into account, as I don't wish to include infix matches.

My first attempt was to just create a large pre-compiled list of regular expressions that look like @"\b{0}\b" (as 10k the phrases are constant - I can cache & re-use this same list against multiple documents);

On my brand-new & very fast PC - this matching is taking 10 seconds+, which I would like to be able to reduce a great deal.

Any advice on how I may be able to achieve this would be greatly appreciated!

Cheers, Dave

Naz · Accepted Answer

You could Lucene.NET and the Shingle Filter as long as you don't mind having a cap on the number of possible words as phrase can have.

public class MyAnalyzer : Analyzer
{
    public override TokenStream TokenStream(string fieldName, TextReader reader)
    {       
        return new ShingleFilter(new LowerCaseFilter(new StandardTokenizer(Lucene.Net.Util.Version.LUCENE_29, reader)), 6);
    }
}

You can run the analyzer using this utility method.

public static IEnumerable GetTerms(Analyzer analyzer, string keywords)
{
    var tokenStream = analyzer.TokenStream("content", new StringReader(keywords));
    var termAttribute = tokenStream.AddAttribute();

    var terms = new HashSet();
    
    while (tokenStream.IncrementToken())
    {
        var term = termAttribute.Term;
        if (!terms.Contains(term))
        {
            terms.Add(term);
        }
    }

    return terms;
}

Once you've retrieved all the terms do an intersect with you words list.

var matchingShingles = GetTerms(new MyAnalyzer(), "Here's my stuff I want to match");

var matchingPhrases = phrasesToMatch.Intersect(matchingShingles, StringComparer.OrdinalIgnoreCase);

I think you will find this method is much faster than Regex matching and respects word boundries.

Matching a large number of strings/phrases

Answers (2)

Related Questions