Reputation: 1964
Given that RAM is much faster than a hard drive, I was surprised by the code below.
I was trying to split a CSV files based on the value of one column, and write each line with different values in that cell to different files.
I was trying:
List<string> protocolTypes = new List<string>();
List<string> splitByProtocol = new List<string>();
foreach (string s in lineSplit)
{
string protocol = getProtocol();
index = protocolTypes.IndexOf(protocol);
splitByProtocol[index] = splitByProtocol[index] + s + "\n";
}
Which took ages, but changing it to a stream writer was much faster:
List<string> protocolTypes = new List<string>();
List<StreamWriter> splitByProtocol = new List<StreamWriter>();
foreach (string s in lineSplit)
{
string protocol = getProtocol();
index = protocolTypes.IndexOf(protocol);
splitByProtocol[index].WriteLine(s);
}
Why is writing to disk so much faster than appending strings together in memory? I know adding to a string requires copying the whole string to a new memory location, but appending a string was orders of magnitude slower than writing to disk which seems counter intuitive.
Upvotes: 2
Views: 2064
Reputation: 10516
First it allocates (a lot) of memory for the new string. Then it copyies over the existing string, and the part that is appended, byte for byte. This takes quite a few cycles, and for every loop the string gets longer so the overall operation time is exponential to the number of loops.
Also garbage collection of the Gen1 will mean that the latest string is copied to Gen2 (so copied again). That will fill up with a bunch of these old strings etc, so we get to Gen2. This approach creates quite some overhead on the GC.
For the disk it's only writing to the stream, so it is first in memory (fast) then disk cache (fast) until it is finally written to disk (slow, but that part is buffered so it'll look very fast). Also it is done only once so performance is pretty much linear with the number of loops.
BTW you might want to look into StringBuilder, that will probably be even faster.
Upvotes: 2
Reputation: 24136
If the strings become huge (many MB) then copying them definitely becomes time-consuming.
However the biggest hit may be caused by the many old strings that are no longer needed, sitting as garbage on the heap, waiting to be collected. So the garbage collector will kick in, possibly even many times, pausing your program every time.
For strings constructed in a loop like this, always consider using StringBuilder
instead. To match your example code:
List<StringBuilder> splitByProtocol = new List<StringBuilder>();
foreach (string s in lineSplit)
{
string protocol = getProtocol();
index = protocolTypes.IndexOf(protocol);
splitByProtocol[index].AppendLine(s);
}
Upvotes: 4
Reputation: 156918
First make sure your measurements are okay.
If still, the StreamWriter
uses a buffer to write to, you append a string which will create a string every time again, which in the end will have excessive memory allocations, while the stream writer is still caching. Note that you aren't flushing, which means that the file is not written until flushed (which isn't forced by your code) and thus could mean you are just storing to a much more efficient memory storage than your string appending does. And even if it gets flushed, it does it at once. With a fast disk you end up faster than overly expensive string concatenation.
If you would use a StringBuilder
for your first code, you will see your execution time will drop significantly. Then you will see the true difference in performance, and I am sure you will see StringBuilder
is faster.
Upvotes: 2