Reputation: 49
I'm working with huge string
data for a project in C#. I'm confused about which approach should I use to manipulate my string
data.
First Approach:
StringBuilder myString = new StringBuilder().Append(' ', 1024);
while(someString[++counter] != someChar)
myString[i++] += someString[counter];
Second Approach:
String myString = new String();
int i = counter;
while(soumeString[++counter] != someChar);
myString = someString.SubString(i, counter - i);
Which one of the two would be more fast(and efficient)? Considering the strings I'm working with are huge.
The strings are already in the RAM
.
The size of the string can vary from 32MB-1GB.
Upvotes: 2
Views: 1513
Reputation: 16792
Per request from OP, here are my test results.
Assumptions:
The "checking" process is simple enough that something like Regex is not needed. For now simplifying to a single char comparison. The below code can easily be modified to consider multiple chars at once, this should have no effect on the relative performance of the two approaches.
public static void Main()
{
string bigStr = GenString(100 * 1024 * 1024);
Stopwatch sw = Stopwatch.StartNew();
for (int i = 0; i < 10; i++)
{
int counter = -1;
StringBuilder sb = new StringBuilder();
while (bigStr[++counter] != 'x')
sb.Append(bigStr[counter]);
Console.WriteLine(sb.ToString().Length);
}
sw.Stop();
Console.WriteLine("StringBuilder: {0}", sw.Elapsed.TotalSeconds);
sw = Stopwatch.StartNew();
for (int i = 0; i < 10; i++)
{
int counter = -1;
while (bigStr[++counter] != 'x') ;
Console.WriteLine(bigStr.Substring(0, counter).Length);
}
sw.Stop();
Console.WriteLine("Substring: {0}", sw.Elapsed.TotalSeconds);
}
public static string GenString(int size)
{
StringBuilder sb = new StringBuilder(size);
for (int i = 0; i < size - 1; i++)
{
sb.Append('a');
}
sb.Append('x');
return sb.ToString();
}
Results (release build, .NET 4):
StringBuilder ~7.9 sec
Substring ~1.9 sec
StringBuilder was consistently > 3x slower, with a variety of different sized strings.
Upvotes: 2
Reputation: 54377
For "huge" strings, it may make sense to take a streamed approach and not load the whole thing into memory. For the best raw performance, you can sometimes squeeze a little more speed out by using pointer math to search and capture pieces of strings.
To be clear, I'm stating two completely different approaches.
1 - Stream
The OP doesn't say how big these strings are, but it may be impractical to load them into memory. Perhaps they are being read from a file, from a data reader connected to a DB, from an active network connection, etc.
In this scenario, I would open a stream, read forward, buffering my input in a StringBuilder
until the criteria was met.
2 - Unsafe Char Manipulation
This requires that you do have the complete string. You can obtain a char* to the start of a string quite simply:
// fix entire string in memory so that we can work w/ memory range safely
fixed( char* pStart = bigString )
{
char* pChar = pStart; // unfixed pointer to start of string
char* pEnd = pStart + bigString.Length;
}
You can now increment pChar
and examine each character. You can buffer it (e.g. if you want to examine multiple adjacent characters) or not as you choose. Once you determine the ending memory location, you now have a range of data that you can work with.
Unsafe Code and Pointers in c#
2.1 - A Safer Approach
If you are familiar with unsafe code, it is very fast, expressive, and flexible. If not, I would still use a similar approach, but without the pointer math. This is similar to the approach which @supercat suggested, namely:
StringBuilder
is good for this; set an initial size and reuse the instance.And an obligatory disclaimer for unsafe code: The vast majority of the time the framework methods are a better solution. They are safe, tested, and invoked millions of times per second. Unsafe code puts all of the responsibility on the developer. It does not make any assumptions; it's up to you to be a good framework/OS citizen (e.g. not overwriting immutable strings, allowing buffer overruns, etc.). Because it does not make any assumptions and removes the safeguards, it will often yield a performance increase. It's up to the developer to determine if there is indeed a benefit, and to decide if the advantages are significant enough.
Upvotes: 4
Reputation: 81179
There's an IndexOf
operation which would search more quickly for someChar
, but I'll assume your real function to find the desired length is more complicated than that. In that scenario, I would recommend copying someString
to a Char[]
, doing the search, and then using the new String(Char[], Int32, Int32)
constructor to produce the final string. Indexing a Char[]
is going to be so much more efficient than indexing an String
or StringBuilder
that unless you expect that you'll typically be needing only a small fraction of the string, copying everything to the Char[]
will be a 'win' (unless, of course, you could simply use something like IndexOf
).
Even if the length of the string will often be much larger than the length of interest, you may still be best off using a Char[]
. Pre-initialize the Char[]
to some size, and then do something like:
Char[] temp = new Char[1024]; int i=0; while (i < theString.Length) { int subLength = theString.Length - i; if (subLength > temp.Length) // May impose other constraints on subLength, provided subLength = temp.Length; // it's greater than zero. theString.CopyTo(i, temp, 0, subLength); ... do stuff with the array i+=subLength; }
Once you're all done, you may then use a single SubString call to construct a string with the necessary characters from the original. If your application requires buinding a string whose characters differ from the original, you could use a StringBuilder
and, within the above loop, use the Append(Char[], Int32, Int32)
method to add processed characters to it.
Note also that when the above loop construct, one may decide to reduce subLength
at any point in the loop provided it is not reduced to zero. For example, if one is trying to find whether the string contains a prime number of sixteen or fewer digits enclosed by parentheses, one could start by scanning for an open-paren; if one finds it and it's possible that the data one is looking for might extend beyond the array, set subLength
to the position of the open-paren, and reloop. Such an approach will result in a small amount of redundant copying, but not much (often none), and will eliminate the need to keep track of parsing state between loops. A very convenient pattern.
Upvotes: 1
Reputation: 93030
You always want to use StringBuilder when manipulating strings. This is becwuse strings are immutable, so every time a new object needs to be created.
Upvotes: -1
Reputation: 726619
You should use IndexOf
rather than doing individual character manipulations in a loop, and add whole chunks of string to the result:
StringBuilder myString = new StringBuilder();
int pos = someString.IndexOf(someChar, counter);
myString.Append(someString.SubString(counter, pos));
Upvotes: 4