user9993
user9993

Reputation: 6170

Why does my object take a long time to be created?

I'm writing code that scans large sections of text and performs some basic statistics on it, such as number of upper and lower case characters, punctuation characters etc.

Originally my code looked like this:

    foreach (var character in stringToCount)
    {
        if (char.IsControl(character))
        {
            controlCount++;
        }
        if (char.IsDigit(character))
        {
            digitCount++;
        }
        if (char.IsLetter(character))
        {
            letterCount++;
        } //etc.
    }

And then from there I was creating a new object like this, which simply reads the local variables and passes them to the constructor:

    var result = new CharacterCountResult(controlCount, highSurrogatecount, lowSurrogateCount, whiteSpaceCount,
        symbolCount, punctuationCount, separatorCount, letterCount, digitCount, numberCount, letterAndDigitCount,
        lowercaseCount, upperCaseCount, tempDictionary);

However a user over on Code Review Stack Exchange pointed out that I can just do the following. Great, I've saved myself a load of code which is good.

    var result = new CharacterCountResult(stringToCount.Count(char.IsControl),
        stringToCount.Count(char.IsHighSurrogate), stringToCount.Count(char.IsLowSurrogate),
        stringToCount.Count(char.IsWhiteSpace), stringToCount.Count(char.IsSymbol),
        stringToCount.Count(char.IsPunctuation), stringToCount.Count(char.IsSeparator),
        stringToCount.Count(char.IsLetter), stringToCount.Count(char.IsDigit),
        stringToCount.Count(char.IsNumber), stringToCount.Count(char.IsLetterOrDigit),
        stringToCount.Count(char.IsLower), stringToCount.Count(char.IsUpper), tempDictionary);

However creating the object the second way takes approximately (on my machine) an extra ~200ms.

How can this be? While it might not seem a significant amount of extra time, it soon adds up when I've left it running processing text.

What should I be doing differently?

Upvotes: 3

Views: 462

Answers (2)

mrjoltcola
mrjoltcola

Reputation: 20842

You are using method groups (syntactic sugar hiding a lambda or delegate) and iterating over the characters many times, whereas you could get it done with one pass (as in your original code).

I remember your previous question, and I recall seeing the recommendation to use the method group and string.Count(char.IsLetterOrDigit) and thinking "yeh that looks pretty but won't perform well", so it was amusing to actually see that you found exactly that.

If performance is important, I would just do it without delegates period, and use one giant loop with a single pass, the traditional way without delegates or multiple iterations, and even further, tune it by organizing the logic such that any case that excludes other cases is organized such that you do "lazy evaluation". Example, if you know a character is whitespace, then don't check for digit or alpha, etc. Or if you know it is digitOrAlpha, then include digit and alpha checks inside that condition.

Something like:

foreach(var ch in string) {
   if(char.IsWhiteSpace(ch)) {
      ...
   }
   else {
      if(char.IsLetterOrDigit(ch)) {
         letterOrDigit++;
         if(char.IsDigit(ch)) digit++;
         if(char.IsLetter(ch)) letter++;
      }  
   }
}

If you REALLY want to micro-optimize, write a program to pre-calculate all of the options and emit a huge switch statement which does table lookups.

switch(ch) {
   case 'A':
        isLetter++;
        isUpper++;
        isLetterOrDigit++;
        break;
   case 'a':
        isLetter++;
        isLower++;
        isLetterOrDigit++;
        break;
   case '!':
        isPunctuation++;

   ...
}

Now if you want to get REALLY crazy, organize the switch statement according to real-life frequency of occurence, and put the most common letters at the top of the "tree", and so forth. Of course, if you care that much about speed, it might be a job for plain C.

But I've wandered a bit far afield from your original question. :)

Upvotes: 4

Scott Chamberlain
Scott Chamberlain

Reputation: 127563

Your old way you walked through the text once, increasing all of your counters as you go. In your new way you walk though the text 13 times (once for each call to stringToCount.Count() and only update one counter per pass.

However, this kind of problem is the perfect situation for Parallel.ForEach. You can walk through the text with multiple threads (being sure your increments are thread safe) and get your totals faster.

Parallel.ForEach(stringToCount, character =>
{
    if (char.IsControl(character))
    {
        //Interlocked.Increment gives you a thread safe ++
        Interlocked.Increment(ref controlCount);
    }
    if (char.IsDigit(character))
    {
        Interlocked.Increment(ref digitCount);
    }
    if (char.IsLetter(character))
    {
        Interlocked.Increment(ref letterCount);
    } //etc.
});

var result = new CharacterCountResult(controlCount, highSurrogatecount, lowSurrogateCount, whiteSpaceCount,
    symbolCount, punctuationCount, separatorCount, letterCount, digitCount, numberCount, letterAndDigitCount,
    lowercaseCount, upperCaseCount, tempDictionary);

It still walks through the text once, but many workers will be walking through various parts of the text at the same time.

Upvotes: 3

Related Questions