Reputation: 49

Better option for String Manipulation - .NET

I'm working with huge string data for a project in C#. I'm confused about which approach should I use to manipulate my string data.

First Approach:

StringBuilder myString = new StringBuilder().Append(' ', 1024);

while(someString[++counter] != someChar)
    myString[i++] += someString[counter];

Second Approach:

String myString = new String();

int i = counter;
while(soumeString[++counter] != someChar);
myString = someString.SubString(i, counter - i);

Which one of the two would be more fast(and efficient)? Considering the strings I'm working with are huge.

The strings are already in the RAM. The size of the string can vary from 32MB-1GB.

Upvotes: 2

Answers (5)

latkin

Reputation: 16792

Per request from OP, here are my test results.

Assumptions:

Big string is already in memory, no requirement for reading from disk
Goal is to not use any native pointers/unsafe blocks

The "checking" process is simple enough that something like Regex is not needed. For now simplifying to a single char comparison. The below code can easily be modified to consider multiple chars at once, this should have no effect on the relative performance of the two approaches.

public static void Main()
{
    string bigStr = GenString(100 * 1024 * 1024);

    Stopwatch sw = Stopwatch.StartNew();
    for (int i = 0; i < 10; i++)
    {
        int counter = -1;
        StringBuilder sb = new StringBuilder();
        while (bigStr[++counter] != 'x')
            sb.Append(bigStr[counter]);
        Console.WriteLine(sb.ToString().Length);
    }
    sw.Stop();
    Console.WriteLine("StringBuilder: {0}", sw.Elapsed.TotalSeconds);

    sw = Stopwatch.StartNew();
    for (int i = 0; i < 10; i++)
    {
        int counter = -1;
        while (bigStr[++counter] != 'x') ;

        Console.WriteLine(bigStr.Substring(0, counter).Length);
    }
    sw.Stop();
    Console.WriteLine("Substring: {0}", sw.Elapsed.TotalSeconds);
}

public static string GenString(int size)
{
    StringBuilder sb = new StringBuilder(size);
    for (int i = 0; i < size - 1; i++)
    {
        sb.Append('a');
    }
    sb.Append('x');
    return sb.ToString();            
}

Results (release build, .NET 4):

StringBuilder ~7.9 sec

Substring ~1.9 sec

StringBuilder was consistently > 3x slower, with a variety of different sized strings.

Upvotes: 2

Tim M.

Reputation: 54377

For "huge" strings, it may make sense to take a streamed approach and not load the whole thing into memory. For the best raw performance, you can sometimes squeeze a little more speed out by using pointer math to search and capture pieces of strings.

To be clear, I'm stating two completely different approaches.

1 - Stream
The OP doesn't say how big these strings are, but it may be impractical to load them into memory. Perhaps they are being read from a file, from a data reader connected to a DB, from an active network connection, etc.

In this scenario, I would open a stream, read forward, buffering my input in a StringBuilder until the criteria was met.

2 - Unsafe Char Manipulation
This requires that you do have the complete string. You can obtain a char* to the start of a string quite simply:

// fix entire string in memory so that we can work w/ memory range safely
fixed( char* pStart = bigString ) 
{
    char* pChar = pStart; // unfixed pointer to start of string
    char* pEnd = pStart + bigString.Length;
}

You can now increment pChar and examine each character. You can buffer it (e.g. if you want to examine multiple adjacent characters) or not as you choose. Once you determine the ending memory location, you now have a range of data that you can work with.

Unsafe Code and Pointers in c#

2.1 - A Safer Approach

If you are familiar with unsafe code, it is very fast, expressive, and flexible. If not, I would still use a similar approach, but without the pointer math. This is similar to the approach which @supercat suggested, namely:

Get a char[].
Read through it character by character.
Buffer where needed. StringBuilder is good for this; set an initial size and reuse the instance.
Analyze buffer where needed.
Dump buffer often.
Do something with the buffer when it contains the desired match.

And an obligatory disclaimer for unsafe code: The vast majority of the time the framework methods are a better solution. They are safe, tested, and invoked millions of times per second. Unsafe code puts all of the responsibility on the developer. It does not make any assumptions; it's up to you to be a good framework/OS citizen (e.g. not overwriting immutable strings, allowing buffer overruns, etc.). Because it does not make any assumptions and removes the safeguards, it will often yield a performance increase. It's up to the developer to determine if there is indeed a benefit, and to decide if the advantages are significant enough.

Upvotes: 4

supercat

Reputation: 81179

There's an IndexOf operation which would search more quickly for someChar, but I'll assume your real function to find the desired length is more complicated than that. In that scenario, I would recommend copying someString to a Char[], doing the search, and then using the new String(Char[], Int32, Int32) constructor to produce the final string. Indexing a Char[] is going to be so much more efficient than indexing an String or StringBuilder that unless you expect that you'll typically be needing only a small fraction of the string, copying everything to the Char[] will be a 'win' (unless, of course, you could simply use something like IndexOf).

Even if the length of the string will often be much larger than the length of interest, you may still be best off using a Char[]. Pre-initialize the Char[] to some size, and then do something like:

Char[] temp = new Char[1024];
int i=0;
while (i < theString.Length)
{
  int subLength = theString.Length - i;
  if (subLength > temp.Length)  // May impose other constraints on subLength, provided
    subLength = temp.Length;    // it's greater than zero.
  theString.CopyTo(i, temp, 0, subLength);
  ... do stuff with the array
  i+=subLength;
}

Once you're all done, you may then use a single SubString call to construct a string with the necessary characters from the original. If your application requires buinding a string whose characters differ from the original, you could use a StringBuilder and, within the above loop, use the Append(Char[], Int32, Int32) method to add processed characters to it.

Note also that when the above loop construct, one may decide to reduce subLength at any point in the loop provided it is not reduced to zero. For example, if one is trying to find whether the string contains a prime number of sixteen or fewer digits enclosed by parentheses, one could start by scanning for an open-paren; if one finds it and it's possible that the data one is looking for might extend beyond the array, set subLength to the position of the open-paren, and reloop. Such an approach will result in a small amount of redundant copying, but not much (often none), and will eliminate the need to keep track of parsing state between loops. A very convenient pattern.

Upvotes: 1

Petar Ivanov

Reputation: 93030

You always want to use StringBuilder when manipulating strings. This is becwuse strings are immutable, so every time a new object needs to be created.

Upvotes: -1

Sergey Kalinichenko

Reputation: 726619

You should use IndexOf rather than doing individual character manipulations in a loop, and add whole chunks of string to the result:

StringBuilder myString = new StringBuilder();
int pos = someString.IndexOf(someChar, counter);
myString.Append(someString.SubString(counter, pos));

Upvotes: 4

Better option for String Manipulation - .NET

Answers (5)

Related Questions