Reputation: 191
So we are told that StringBuilder should be used when you are doing more than a few operations on a string (I've heard as low as three). Therefore we should replace this:
string s = "";
foreach (var item in items) // where items is IEnumerable<string>
s += item;
With this:
string s = new StringBuilder(items).ToString();
I assume that internally StringBuilder holds references to each Appended string, combining then on request. Lets compare this to the HybridDictionary, that uses a LinkedList for the first 10 elements, then swaps to a HashTable when the list grows more then 10. As we can see the same kind of pattern is here, small number of references = linkedList, else make ever increasing blocks of arrays.
Lets look at how a List works. Start off with a list size (internal default is 4). Add elements to the internal array, if the array is full, make a new array of double the size of the current array, copy the current array's elements across, then add the new element and make the new array the current array.
Can you see my confusion as to the performance benefits? For all elements besides strings, we make new arrays, copy old values and add the new value. But for strings that's bad? because we know that "a" + "b" makes a new string reference from the two old references, "a" and "b".
Hope my question isn't too confusing. Why does there seem to be a double standard between string concatenation and array concatenation (I know strings are arrays of chars)?
String: Making new references is bad!
T : where T != String: Making new references is good!
Edit: Maybe what I'm really asking here, is when does making new, bigger arrays and copying the old values across, start being faster than have references to randomly places objects all over the heap?
Double edit: By faster I mean reading, writing and finding variables, not inserting or removing (i.e. LinkedList would kickass at inserting for example, but I don't care about that).
Final edit: I don't care about StringBuilder, I'm interested in the trade off in time taken to copy data from one part of the heap to another for cache alignments, vs just taking the cache misses from teh cpu and have references all over the heap. When does one become faster then the other?*
Upvotes: 2
Views: 2203
Reputation: 74287
The backing store of a StringBuilder
is a char[]
that gets resized as needed. Nothing is turned into a string until you invoke StringBuilder.ToString()
on it.
The backing store of List<T>
is a T[]
that gets resized as needed.
The problem with something like
string s = a + b + c + d ;
is that the compiler parses it as
+
/ \
a +
/ \
b +
/ \
c d
and, unless it can see opportunities for optimization, do something like
string t1 = c + d ;
string t2 = b + t1 ;
string s = a + t2 ;
thus creating two temporaries and the final string. With a StringBuilder
, though, it's going to build out the character array it needs and at the end create one string.
This is a win because strings, once created, are immutable (can't be changed) and are generally interned in the string pool (meaning that there is only ever one instance of the string...no matter how many time your create the string "abc"
, every instance will always be a reference to the same object in the string pool.
This adds cost to string creation as well: having determined the candidate string, the runtime has to check the string pool to see if it already exists. If it does, that reference is used; if it does not the candidate string is added to the string pool.
Your example, though:
string s = "a" + "b" + "c" + "d" ;
is a non-sequitur: the compile sees the constant expression and does an optimization called constant folding, so it becomes (even in debug mode):
string s = "abcd" ;
Similar optimizations happen with arithmetic expressions:
int x = 12 / 3 ;
is going to be optimized away to
int x = 4 ;
Upvotes: 0
Reputation: 34198
It seems to me that you are conflating the resizing of a list with the evaluation of a string expression, and assuming that the two should behave the same way.
Consider your example: string s = "a" + "b" + "c" + "d"
Assuming no optimisations of the constant expression (which the compiler would handle automatically), what this will do is evaluate each operation in turn:
string s = (("a" + "b") + "c") + "d"
This results in the strings "ab"
and "abc"
being created as part of that single expression. This has to happen, because strings
in.NET are immutable, which means their values cannot be changed once created. This is because, if strings were mutable, you'd have code like this:
string a = "hello";
string b = a; // would assign b the same reference as a
string b += "world"; // would update the string it references
// now a == "helloworld"
If this were a List
, the code would make more sense, and doesn't even need explanation:
var a = new List<int> { 1, 2, 3 };
var b = a;
b.Add(4);
// now a == { 1, 2, 3, 4 }
So the reason that non-string "list" types allocate extra memory early is for reasons of efficiency, and to reduce allocations when the list is extended. The reason that a string
does not do that is because a string
's value is never updated once created.
Your assumption about the operation of the StringBuilder
is irrelevant, but the purpose of a StringBuilder
is essentially to create a non-immutable object that reduces the overhead of multiple string
operations.
Upvotes: 1
Reputation: 203835
Therefore we should replace this:
No you shouldn't. The first case you showed string concatenation that can take place at compile time and have replaced it with string concatenation that takes place a runtime. The former is much more desirable, and will execute faster than the latter.
It's important to use a string builder when the number of strings being concatted is not known at compile time. Often (but not always) this means concatting strings in a loop.
Earlier versions of String Builder (before 4.0, if memory serves), did internally look more or less like a List<char>
, and it's correct that post 4.0 it looks more like a LinkedList<char[]>
. However, the key difference here between using a StringBuilder
and using regular string concatenation in a loop is not the difference between a linked list style in which objects contain references to the next object in the "chain" and an array-based style in which an internal buffer overallocates space and is reallocated occasionally as needed, but rather the difference between a mutable object and an immutable object. The problem with traditional string concatenation is that, since strings are immutable, each concatenation must copy all of the memory from both strings into a new string. When using a StringBuilder
the new string only needs to be copied onto the end of some type of data structure, leaving all of the existing memory as it is. What type of data structure that is isn't terribly important here; we can rely on Microsoft to use a structure/algorithm that has been proven to have the best performance characteristics for the most common situations.
Upvotes: 4