Makah
Makah

Reputation: 4513

Fastest way to compare one String to a group of Wildcards

I have a Dictionary and my Key is a String with Wildcards. I want to know if one string match with any key in the dictinary.

Example:

String str = "Really Large String";
Dictionary dic = new Dictionary<String, MyClass>();
dic.Add("First+Match*", new MyClass());
dic.Add("*Large*", new MyClass());

Edit: I want to do something like:

foreach(var s in dic.Keys){
  if(str.Match(s))
    //Do Something
}

Upvotes: 4

Views: 794

Answers (3)

Jodrell
Jodrell

Reputation: 35716

Why not,

var dic = Dictionary<Regex, MyClass>()
dic.Add(new Regex("..."), new MyClass)
....

foreach(var match in dic.Keys.Where(k => k.IsMatch(str)))
{
    var myClass = dic[match];
    ....
}

Now a question, why use a dictionary at all, why not extend MyClass to match against a string itself, perhaps with a Predicate called Match.

var matchers = new HashSet<MyClass>();
matchers.Add(new MyClass("some regex?");
....

foreach(var match in matchers.Where(Match(str)))
{
    ....
}

EDIT

If you only want the first match then you could use FirstOrDefault instead of Where.

var firstMatch = matchers.FirstOrDefault(Match(str))
if (firstMatch != null)
{
    ....
}

However, this will make the order of the list significant.

EDIT 2

A partial implementation of MyClass to iclude a Match predicate could be ...

partial class MyClass
{
    private readonly RegEx matcher;

    public MyClass(string regEx)
    {
        matcher = new RegEx(regEx);
    }

    public bool Match(string value)
    {
        return matcher.IsMatch(value);
    }
}

Upvotes: 2

CMalcheski
CMalcheski

Reputation: 7

This explanation is going to be a bit longer than I'd like, but in a case like this it can be very useful to fully understand what's going on under the hood. Because a particular approach looks efficient in your source code does not mean the CPU sees it the same way. When you're concerned about byte-for-byte execution speed, you have to understand what's happening at the level where the operation will actually be performed. Anything above that level is just semantics and, ultimately, glorified macros that won't give you an accurate picture of what you're actually creating.

Intel / AMD CPU's have a set of repeat-scan instructions that allow you to set a pointer, put a byte into a register, set the number of bytes to scan, and internally the CPU blasts off and runs the scan as a single internal instruction, scanning byte for byte until a match or non-match is found (or the counter runs out). When the counter runs out, it can be a messy process to adjust your pointers and handle the "criteria not met; I ran out of counter!" condition; this won't affect your code directly but it could affect execution time if you're doing a lot of separate searches within a loop. For this reason it's never a bad idea to minimize the number of actual searches being done. It's not a huge factor but it could factor in.

What I do in my own code, in the case of large searches, is scan forward for the first byte. Matching that as a starting point for any further process saves the vast majority of time that would otherwise be wasted by comparing every byte. Let the CPU do it. There's an insane amount of work the CPU has to do, just to load an instruction and get ready to execute it so any time you can cut that work down, the program is going to run faster.

The problem here is that you don't have a lot of direct control over these things. Whatever language you're using is most likely going to use the slowest possible, brute-force approach: pick up a byte, look at it, pick up the next byte, yawn yawn yawn. If that's how your language is doing it (most do), then there isn't going to be much difference between any two choices for coding methodology: they're already all stuck in the performance basement. If you were in a situation where you could insert a block of assembly code, you'd be at an enormous benefit. But most people are forbidden from doing that because in 1985 it was considered illegal. Of course there are 32 and 64 bit issues to consider but those can be accounted for. Anyway, the bottom line is that if you can find out exactly what your particular language does, then take that into consideration. If it's doing the hunt-and-peck method of comparing strings, then the battle is already lost and most of what you do to tweak your code is probably not going to have much effect.

Upvotes: -1

Adriano Repetti
Adriano Repetti

Reputation: 67090

You can use RegEx, simply convert a string with wildcards to a regex pattern (I assume you want to use the pretty old standard "*" and "?" wildcards):

public static string ToRegEx(string pattern)
{
    return Regex.Escape(pattern).Replace("\\*", ".*").Replace("\\?", ".");
}

Upvotes: 2

Related Questions