Samresh
Samresh

Reputation: 58

performance while doing string concatenation - algorithm string strings c#

I was using the following code to append strings

string res = string.Empty;
int ite = 100000;
for(int i= 0; i < ite; i++)
{
    res += "5";
}

It was taking a lot of time, so I later changed the code to

string res = string.Empty;
int ite = 100000;
res = getStr(ite / 2) + getStr(ite - (ite / 2)); 

//body of getStr method
private static string getStr(int p)
{
    if (p == 1)
        return "5";
    else if (p == 0)
        return string.Empty;
    string r1 = getStr(p / 2); //recursive
    string r2 = getStr(p - (p / 2)); //recursive  
    return (r1 + r2);
}

which in my opinion does nothing actually as the number of times strings are concatenated is approximately the same as in previous approach.

But using this approach there was significant improvement in performance as the code which was taking around 2500 ms (on my machine) was now taking 10 ms.

I ran a profiler over cpu time and couldn't arrive at an understanding as to why there is an improvement in performance. Can anyone please explain this.

Note: I am intentionally not using StringBuilder, in order to understand the above.

Upvotes: 4

Views: 993

Answers (2)

PJTraill
PJTraill

Reputation: 1420

Assuming - as per Matt Burland's answer - that the time cost of creating a string of length n by one of the given algorithms is dominated by the number of characters copied, the observed timings may be largely explained by calculating this for both algorithms. This yields O(n2) and O(n log n) and, for length 10,000, a ratio of 348:1. The algorithm may be improved to O(n) in Java, but apparently not in .NET.

Cost of the improved algorithm

Inspection of the improved algorithm shows that the number c(n) of characters copied obeys the following recurrence relation:

c(0) = 0
c(1) = 1
c(n) = c(⌊n/2⌋) + c(⌈n/2⌉) + n

This may be solved to yield

c(2k + a) = (k + 1)2k + (k + 3)a

where k and a are chosen so that n = 2k + a , a < 2k ; this is readily verified by complete induction. This is O(k 2k), i.e. O(n log2n), i.e. O(n log n),

Explanation: Comparison of costs

The original algorithm clearly copies n(n+1)/2 characters, and is thus O(n2).

The revised algorithm clearly copies far fewer characters; for the given 10,000 character string:

c(10000) =
c(213 + 1808) =
(13+1) * 8192 + 16 * 1808 =
143,616

while the original algorithm copies 50,005,000 characters, a ratio of approximately 1 : 348, consistent to well within an order of magnitude with the observed ratio of 1:250. The imperfect match does suggest that other factors such as memory management may be significant.

Further optimisation

Given that the string is filled with a single character, and assuming that String.Substring does not make a copy of the string, which is true in Java but not .NET according to comparison-of-substring-operation-performance-between-net-and-java , we can improve on the second algorithm (without using StringBuilder or String('5', ite)) by continually doubling the constructed string, adding an extra character when necessary:

private static string getStr(int p)
{
    if(p == 0)
        return "";
    if(p == 1)
        return "5";
    string s = getStr ((p+1) / 2);
    if( s.Length + s.Length == p )
        return s + s;
    else
        return s + s.Substring(0, p - s.Length);
}

For the number of characters c2(n) copied by this algorithm we have

c2(n) = n + c2(⌈n/2⌉)

from which we may derive

c2(n) = 2_n_ + d(n)

where d(n) is -1 if n is a power of 2 and is otherwise the number of “internal” (i.e. neither leading nor trailing) digits equal to 0 in the binary expansion of m; equivalently, d(n) is defined by the first matching case for m ∈ ℕ in:

d(2m) = -1
d(2 m) = d(m)
d(m) = the number of essential (non-leading) 0 binary digits in m

The expression for c2 may also be verified by complete induction and is O(n + log n), i.e. O(n).

It is fairly straightforward to remove the recursion from this algorithm.

In the OP’s case this algorithm copies c2(10,000) = 20,000 + d(110000110101000002) = 20,006 characters and would thus appear to be some 7 times faster.

Other remarks

  • This analysis applies to creating a multiple of any string, not just "5".
  • The most efficient way of constructing the OP’s string is presumably String('5', ite).
  • If using a StringBuilder to build a string of known size, one may reduce allocations by using StringBuilder(capacity).
  • This analysis applies to other environments than .NET.
  • In C, allocate a buffer of the right size (including '\0'!), copy in the string to be repeated and then repeatedly append a copy of the filled part of the buffer until it is full.

Upvotes: 1

Matt Burland
Matt Burland

Reputation: 45135

You need to think about why string concatenation is slow. Strings are immutable, so when you do:

someString+= "5";

You have to copy the entire contents of someString and to another string that is one larger and then copy in the 5 part. If you think about it, this gets slower and slower the longer the string gets.

With your recursive function you are doing a divide and conquer strategy to help to minimize the number of big string concatenations you need. For example, if you had a length of 8, in the first case you'd be doing:

"5" + "5" 
"55" + "5"
"555" + "5"
"5555" + "5"
"55555" + "5"
"555555" + "5"
"5555555" + "5"    // 7 total concatenations

In your recursive model you are doing:

"5" + "5"         // Four times
"55" + "55"       // twice
"5555" + "5555"   // once 

So you are doing less big concatenations.

And, of course, I assume the OP knows this from their comment, but for anybody else; if you need to concatenate any non-trivial number of strings, use StringBuilder as it is optimized for building strings by Appending them together.

Upvotes: 15

Related Questions