ilitirit
ilitirit

Reputation: 16352

Fast string suffix checking in C# (.NET 4.0)?

What is the fastest method of checking string suffixes in C#?

I need to check each string in a large list (anywhere from 5000 to 100000 items) for a particular term. The term is guaranteed never to be embedded within the string. In other words, if the string contains the term, it will be at the end of the string. The string is also guaranteed to be longer than the suffix. Cultural information is not important.

These are how different methods performed against 100000 strings (half of them have the suffix):

 1.  Substring Comparison - 13.60ms
 2.  String.Contains - 22.33ms
 3.  CompareInfo.IsSuffix - 24.60ms
 4.  String.EndsWith - 29.08ms
 5.  String.LastIndexOf - 30.68ms

These are average times. [Edit] Forgot to mention that the strings also get put into separate lists, but this is not important. It does add to the running time though.

On my system substring comparison (extracting the end of the string using the String.Substring method and comparing it to the suffix term) is consistently the fastest when tested against 100000 strings. The problem with using substring comparison though is that Garbage Collection can slow it down considerably (more than the other methods) because String.Substring creates new strings. The effect is not as bad in .NET 4.0 as it was in 3.5 and below, but it is still noticeable. In my tests, String.Substring performed consistently slower on sets of 12000-13000 strings. This will obviously differ between systems and implementations.

[EDIT] Benchmark code: http://pastebin.com/smEtYNYN

[EDIT] FlyingStreudel's code runs fast, but Jon Skeet's recommendation of using EndsWith in conjunction with StringComparison.Ordinal appears to be the best option.

Upvotes: 6

Views: 7656

Answers (7)

user1088761
user1088761

Reputation: 1

There's a lot of good information here. I wanted to note that if your suffix is short, it could be even faster to look at the last few characters individually. My modified version of the benchmark code in question is here: http://pastebin.com/6nNdbEvW. It gives theses results:

  1. Last char equality: 1.52 ms (50000)
  2. Last 2 char equality: 1.56 ms (50000)
  3. EndsWith using StringComparison.Ordinal: 3.75 ms (50000)
  4. Contains: 11.10 ms (50000)
  5. LastIndexOf: 14.85 ms (50000)
  6. IsSuffix: 11.30 ms (50000)
  7. Substring compare: 17.69 ms (50000)

Upvotes: -1

FlyingStreudel
FlyingStreudel

Reputation: 4464

I dunno how fast this is, but this is what I would do?

static bool HasSuffix(string check, string suffix)
{
    int offset = check.Length - suffix.Length;
    for (int i = 0; i < suffix.Length; i++)
    {
        if (check[offset + i] != suffix[i])
        {
            return false;
        }
    }
    return true;
}

edit: OOPS x2

edit: So I wrote my own little benchmark... does this count? It runs 25 trials of evaluating one million strings and takes the average of the difference in performance. The handful of times I ran it it was consistently outputting that CharCompare was faster by ~10-40ms over one million records. So that is a hugely unimportant increase in efficiency (.000000001s/call) :) All in all I doubt it will matter which method you implement.

class Program
{
    volatile static List<string> strings;
    static double[] results = new double[25];
    static void Main(string[] args)
    {
        strings = new List<string>();
        Random r = new Random();
        for (int rep = 0; rep < 25; rep++)
        {
            Console.WriteLine("Run " + rep);
            strings.Clear();
            for (int i = 0; i < 1000000; i++)
            {
                string temp = "";
                for (int j = 0; j < r.Next(3, 101); j++)
                {
                    temp += Convert.ToChar(
                        Convert.ToInt32(
                        Math.Floor(26 * r.NextDouble() + 65)));
                }
                if (i % 4 == 0)
                {
                    temp += "abc";
                }
                strings.Add(temp);
            }
            OrdinalWorker ow = new OrdinalWorker(strings);
            CharWorker cw = new CharWorker(strings);
            if (rep % 2 == 0)
            {
                cw.Run();
                ow.Run();
            }
            else
            {
                ow.Run();
                cw.Run();                    
            }
            Thread.Sleep(1000);
            results[rep] = ow.finish.Subtract(cw.finish).Milliseconds;
        }
        double tDiff = 0;
        for (int i = 0; i < 25; i++)
        {
            tDiff += results[i];
        }
        double average = tDiff / 25;
        if (average < 0)
        {
            average = average * -1;
            Console.WriteLine("Char compare faster by {0}ms average", 
                average.ToString().Substring(0, 4));
        }
        else
        {
            Console.WriteLine("EndsWith faster by {0}ms average", 
                average.ToString().Substring(0, 4));
        }

    }
}   



class OrdinalWorker
{
    List<string> list;
    int count;
    public Thread t;
    public DateTime finish;
    public OrdinalWorker(List<string> l)
    {
        list = l;
    }

    public void Run()
    {
        t = new Thread(() => {
            string suffix = "abc";
            for (int i = 0; i < list.Count; i++)
            {
                count = (list[i].EndsWith(suffix, StringComparison.Ordinal)) ? 
                    count + 1 : count;
            }
            finish = DateTime.Now;
        });
        t.Start();
    }
}

class CharWorker 
{
    List<string> list;
    int count;
    public Thread t;
    public DateTime finish;
    public CharWorker(List<string> l)
    {
        list = l;
    }

    public void Run()
    {
        t = new Thread(() =>
        {
            string suffix = "abc";
            for (int i = 0; i < list.Count; i++)
            {
                count = (HasSuffix(list[i], suffix)) ? count + 1 : count;
            }
            finish = DateTime.Now;
        });
        t.Start();
    }

    static bool HasSuffix(string check, string suffix)
    {
        int offset = check.Length - suffix.Length;
        for (int i = 0; i < suffix.Length; i++)
        {
            if (check[offset + i] != suffix[i])
            {
                return false;
            }
        }
        return true;
    }
}

Upvotes: 4

Grant Thomas
Grant Thomas

Reputation: 45083

I don't profess to be an expert in this area, however I felt compelled to at least profile this to some extent (knowing full well that my fictitious scenario will differ substantially from your own) and here is what I came up with:

It seems, at least for me, EndsWith takes the lead with LastIndexOf consistently coming in second, some timings are:

SubString:    00:00:00.0191877
Contains:     00:00:00.0201980
CompareInfo:  00:00:00.0255181
EndsWith:     00:00:00.0120296
LastIndexOf:  00:00:00.0133181

These were gleaned from processing 100,000 strings where the desired suffix appeared in all strings and so to me simply echoes Jon's answer (where the benefit is both speed and descriptiveness). And the code used to come to these results:

class Program
{
    class Profiler
    {
        private Stopwatch Stopwatch = new Stopwatch();

        public TimeSpan Elapsed { get { return Stopwatch.Elapsed; } }

        public void Start()
        {
            Reset();
            Stopwatch.Start();
        }

        public void Stop()
        {            
            Stopwatch.Stop();
        }

        public void Reset()
        {
            Stopwatch.Reset();
        }
    }

    static string suffix = "_sfx";
    static Profiler profiler = new Profiler();
    static List<string> input = new List<string>();
    static List<string> output = new List<string>();

    static void Main(string[] args)
    {
        GenerateSuffixedStrings();

        FindStringsWithSuffix_UsingSubString(input, suffix);
        Console.WriteLine("SubString:    {0}", profiler.Elapsed);

        FindStringsWithSuffix_UsingContains(input, suffix);
        Console.WriteLine("Contains:     {0}", profiler.Elapsed);

        FindStringsWithSuffix_UsingCompareInfo(input, suffix);
        Console.WriteLine("CompareInfo:  {0}", profiler.Elapsed);

        FindStringsWithSuffix_UsingEndsWith(input, suffix);
        Console.WriteLine("EndsWith:     {0}", profiler.Elapsed);

        FindStringsWithSuffix_UsingLastIndexOf(input, suffix);
        Console.WriteLine("LastIndexOf:  {0}", profiler.Elapsed);

        Console.WriteLine();
        Console.WriteLine("Press any key to exit...");
        Console.ReadKey();
    }

    static void GenerateSuffixedStrings()
    {
        for (var i = 0; i < 100000; i++)
        {
            input.Add(Guid.NewGuid().ToString() + suffix);
        }
    }

    static void FindStringsWithSuffix_UsingSubString(IEnumerable<string> strings, string suffix)
    {
        output.Clear();
        profiler.Start();
        foreach (var s in strings)
        {
            if(s.Substring(s.Length - 4) == suffix)
                output.Add(s);
        }
        profiler.Stop();
    }

    static void FindStringsWithSuffix_UsingContains(IEnumerable<string> strings, string suffix)
    {
        output.Clear();
        profiler.Start();
        foreach (var s in strings)
        {
            if (s.Contains(suffix))
                output.Add(s);
        }
        profiler.Stop();
    }

    static void FindStringsWithSuffix_UsingCompareInfo(IEnumerable<string> strings, string suffix)
    {
        var ci = CompareInfo.GetCompareInfo("en-GB");
        output.Clear();
        profiler.Start();
        foreach (var s in strings)
        {
            if (ci.IsSuffix(s, suffix))
                output.Add(s);
        }
        profiler.Stop();
    }

    static void FindStringsWithSuffix_UsingEndsWith(IEnumerable<string> strings, string suffix)
    {
        output.Clear();
        profiler.Start();
        foreach (var s in strings)
        {
            if (s.EndsWith(suffix))
                output.Add(s);
        }
        profiler.Stop();
    }

    static void FindStringsWithSuffix_UsingLastIndexOf(IEnumerable<string> strings, string suffix)
    {
        output.Clear();
        profiler.Start();
        foreach (var s in strings)
        {
            if (s.LastIndexOf(suffix) == s.Length - 4)
                output.Add(s);
        }
        profiler.Stop();
    }
}

EDIT:

As commented, I attempted this again with only some of the strings having a suffix applied and these are the results:

SubString:    00:00:00.0079731
Contains:     00:00:00.0243696
CompareInfo:  00:00:00.0334056
EndsWith:     00:00:00.0196668
LastIndexOf:  00:00:00.0229599

The string generator method was updated as follows, to produce the strings:

    static void GenerateSuffixedStrings()
    {
        var nxt = false;
        var rnd = new Random();            
        for (var i = 0; i < 100000; i++)
        {
            input.Add(Guid.NewGuid().ToString() + 
                (rnd.Next(0, 2) == 0 ? suffix : string.Empty));
        }
    }

Further, this trend continues if none of the string have a suffix:

SubString:    00:00:00.0055584
Contains:     00:00:00.0187089
CompareInfo:  00:00:00.0228983
EndsWith:     00:00:00.0114227
LastIndexOf:  00:00:00.0199328

However, this gap shortens again when assigning a quarter of the inputs a suffix (the first quarter, then sorting to randomise the coverage):

SubString:    00:00:00.0302997
Contains:     00:00:00.0305685
CompareInfo:  00:00:00.0306335
EndsWith:     00:00:00.0351229
LastIndexOf:  00:00:00.0322899

Conclusion? IMO, and agreeing with Jon, EndsWith seems the way to go (based on this limited test, anyway).

Further Edit:

To cure Jon's curiosity I ran a few more tests on EndsWith, with and without Ordinal string comparison...

On 100,000 strings with a quarter of them suffixed:

EndsWith:            00:00:00.0795617
OrdinalEndsWith:     00:00:00.0240631

On 1,000,000 strings with a quarter of them suffixed:

EndsWith:            00:00:00.5460591
OrdinalEndsWith:     00:00:00.2807860

On 10,000,000 strings with a quarter of them suffixed:

EndsWith:            00:00:07.5889581
OrdinalEndsWith:     00:00:03.3248628

Note that I only ran the last test once as generating the strings proved this laptop is in need of a replacement

Upvotes: 0

Eric Lippert
Eric Lippert

Reputation: 660138

I took a look at your benchmark code and frankly, it looks dodgy.

You are measuring all kinds of extraneous things along with what it is you want to measure; you're measuring the cost of the foreach and the adding to a list, both of which might have costs of the same order of magnitude as the thing you are attempting to test.

Also, you are not throwing out the first run; remember, the JIT compiler is going to jit the code that you call the first time through the loop, and it is going to be hot and ready to go the second time, so your results will therefore be skewed; you are averaging one potentially very large thing with many small things. In the past when I have done this I have discovered situations where the jit time actually dominated the time of everything else. Is that realistic? Do you mean to measure the jit time, or should it be not considered as part of the average?

Upvotes: 6

Jon Skeet
Jon Skeet

Reputation: 1500785

If that's the time taken to check 100,000 strings, does it really matter?

Personally I'd use string.EndsWith on the grounds that it's the most descriptive: it says exactly what you're trying to test.

I'm somewhat suspicious of the fact that it appears to be performing worst though... if you could post your benchmark code, that would be very useful. (In particular, it really shouldn't have to do as much work as string.Contains.)

Have you tried specifying an ordinal match? That may well make it significantly faster:

if (x.EndsWith(y, StringComparison.Ordinal))

Of course, you shouldn't do that unless you want an ordinal comparison - are you expecting culturally-sensitive matches? (Developers tend not to consider this sort of thing, and I very firmly include myself in that category.)

Upvotes: 19

Eric Lippert
Eric Lippert

Reputation: 660138

Jon is absolutely right; this is potentially not an apples-to-apples comparison because different string methods have different defaults for culteral sensitivity. Be very sure that you are getting the comparison semantics you intend to in each one.

In addition to Jon's answer, I'd add that the relevant question is not "which is fastest?" but rather "which is too slow?" What's your performance goal for this code? The slowest method still finds the result in less time than it takes a movie projector to advance to the next frame, and obviously that is not noticable by humans. If your goal is that the search appears instantaneous to the user then you're done; any of those methods work. If your goal is that the search take less than a millisecond then none of those methods work; they are all orders of magnitude too slow. What's the budget?

Upvotes: 14

ykatchou
ykatchou

Reputation: 3727

Did you try direct access ? I mean, you can make a loop watching for similar string, it could be faster than make a substring and having the same behaviour.

int i,j;
foreach(String testing in lists){
   i=0;
   j=0;
   int ok=1;
   while(ok){
      i = testing.lenght - PATTERN.lenght;
      if(i>0 && i<testing.lenght && testing[i] != PATTERN[j])
        ok = 0;
      i++;
      j++;
   }
   if(ok) return testing;
}

Moreover if it's big strings, you could try using hashs.

Upvotes: 0

Related Questions