EOG
EOG

Reputation: 1747

Fast way to make sure string does not contain specific characters

I want to make sure that C# string does not contain specific characters.

I am using string.IndexOfAny(char[]), in my opinion Regex would be slower in this task. Is there a better way to accomplish this? Speed is critical in my application.

Upvotes: 6

Views: 2165

Answers (4)

Tim Schmelter
Tim Schmelter

Reputation: 460158

You could use this concise and efficient LINQ query:

HashSet<char> specificChars = new HashSet<char>{ 'a', 'b', 'c'};
bool notContainsSpecificChars = !"test".Any(specificChars.Contains); // true

I have used a HashSet<char> since it is efficient for lookups, duplicates aren't allowed.

if you have an array as input you could use the constructor to create a HashSet from it:

char[] chars = new[] { 'a', 'b', 'c', 'c' };
specificChars = new HashSet<char>(chars); // c is removed since it was a duplicate

Another approach without HashSet is using Enumerable.Intersect + Enumerable.Any:

bool notContainsSpecificChars = !"test".Intersect(chars).Any();

Upvotes: 2

Kami
Kami

Reputation: 19407

Ran a quick benchmark on IndexOf vs IndexOfAny vs Regex vs Hashset.
500 word lorem ipsum haystack, with two character needle.
Tested with both needles in haystack, one in haystack, neither in haystack.

    private long TestIndexOf(string haystack, char[] needles)
    {
        System.Diagnostics.Stopwatch sw = new System.Diagnostics.Stopwatch();
        sw.Start();
        for (int i = 0; i < 1000000; i++)
        {
            int x = haystack.IndexOfAny(needles);
        }
        sw.Stop();

        return sw.ElapsedMilliseconds;
    }

    private long TestRegex(string haystack, char[] needles)
    {
        System.Diagnostics.Stopwatch sw = new System.Diagnostics.Stopwatch();
        sw.Start();
        Regex regex = new Regex(string.Join("|", needles));
        for (int i = 0; i < 1000000; i++)
        {
            Match m = regex.Match(haystack);
        }
        sw.Stop();

        return sw.ElapsedMilliseconds;
    }

    private long TestIndexOf(string haystack, char[] needles)
    {
        System.Diagnostics.Stopwatch sw = new System.Diagnostics.Stopwatch();
        sw.Start();
        for (int i = 0; i < 1000000; i++)
        {
            int x = haystack.IndexOf(needles[0]);
        }
        sw.Stop();

        return sw.ElapsedMilliseconds;
    }

    private long TestHashset(string haystack, char[] needles)
    {
        HashSet<char> specificChars = new HashSet<char>(needles.ToList());
        System.Diagnostics.Stopwatch sw = new System.Diagnostics.Stopwatch();
        sw.Start();
        for (int i = 0; i < 1000000; i++)
        {
            bool notContainsSpecificChars = !haystack.Any(specificChars.Contains);
        }
        sw.Stop();

        return sw.ElapsedMilliseconds;
    }

Result for 1,000,000 iterations:

Index of : 28/2718/2711
Index of Any : 153/141/17561
Regex of : 1068/1102/92324
Hashset : 939/891/111702

Notes:

  • Smaller haystack improves performance.
  • Larger needle set increases regex performance.
  • Larger needle set reduces indexofany performance.
  • If needle is not in haystack all methods degrade in performance

Overall, regex is slower then indexofany by upto 10x depending on the haystack and needle sizes.

Upvotes: 6

Steve Ruble
Steve Ruble

Reputation: 3895

String.IndexOfAny(char[]) is implemented in the CLR itself and String.IndexOf uses an extern call, so they're both pretty fast. Either will be much faster than using a regex.

Whether IndexOf is better than IndexOfAny depends on how many characters you're expecting to be checking for. Based on some really rough benchmarking, it looks like IndexOf performs better (by a tiny amount) for 2 or fewer characters, but IndexOfAny is better for 3 or more. The difference is minuscule, though - the advantage of using IndexOfAny may be overwhelmed by the cost of allocating the array of characters.

Upvotes: 0

stefano m
stefano m

Reputation: 4234

if you must find one char only, it woulb be better call method IndexOf(singleChar) or IndexOf(singleChar, startIndex, charCount) instead.

Of cuorse Regex would be much more computational expensive!

Upvotes: 0

Related Questions