user18860244
user18860244

Reputation:

Faster method to remove non-letter characters from a string

I want to remove all characters from a string, except Unicode letters.

I consider using this code:

public static string OnlyLetters(string text)
{
    return new string (text.Where(c => Char.IsLetter(c)).ToArray());
}

But maybe Regex will be faster?

public static string OnlyLetters(string text)
{
    Regex rgx = new Regex("[^\p{L}]");
    return rgx.Replace(text, "");
}

Could you verify this code and suggest which one should I choose?

Upvotes: 0

Views: 315

Answers (1)

Dmitrii Bychenko
Dmitrii Bychenko

Reputation: 186708

If you want to know which horse is faster, you can perform races:

Often, manual manipulations appear to be fast. Let's try this approach:

private static string ManualReplace(string value)
{
  // Let's allocate memory only once - value.Length characters
  StringBuilder sb = new StringBuilder(value.Length);

  foreach (char c in value)
    if (char.IsLetter(c))
      sb.Append(c);

  return sb.ToString();
}

Races:

// 123 - seed - in order for the text to be the same
Random random = new Random(123);

// Let's compile the regex
Regex rgx = new Regex(@"[^\p{L}]", RegexOptions.Compiled);
string result = null; // <- makes the compiler happy

string text = string.Concat(Enumerable
                            .Range(1, 10_000_000)
                            .Select(_ => (char)random.Next(32, 128)));

Stopwatch sw = new Stopwatch();

// Warming: let .NET compile CIL, fill caches, allocate memory, etc.
int warming = 5;

for (int i = 0; i < warming; ++i)
{
  if (i == warming - 1)
    sw.Start();

  // result = new string(text.Where(c => char.IsLetter(c)).ToArray());

  result = rgx.Replace(text, "");

  // result = string.Concat(text.Where(c => char.IsLetter(c)));

  // result = ManualReplace(text);

  if (i == warming - 1)
    sw.Stop();
}

Console.WriteLine($"{sw.ElapsedMilliseconds}");

Run this several times, and you'll get the results. Mine (.NET 6, Release) are:

new string    : 120 ms
rgx.Replace   : 350 ms
string.Concat : 150 ms
Manual        :  80 ms

So we have the winner. It's Manual replace; among the others new string (text.Where(c => Char.IsLetter(c)).ToArray()); is the fastest, string.Concat is slightly slower, and Regex.Replace is a loser.

Upvotes: 4

Related Questions