urlreader
urlreader

Reputation: 6605

How to get the # of occurances of a char in a string FAST in C#?

I have a txt file. Right now, I need to load it line by line, and check how many times a '@' are in the entire file.

So, basically, I have a single line string, how to get the # of occurances of '@' fast?

I need to count this fast since we have lots of files like this and each of them are about 300-400MB.

I searched, it seems the straightforward way is the fastest way to do this:

int num = 0;
foreach (char c in line)
{
    if (c == '@') num++;
}

Is there a different method that could be faster than this? Any other suggestions?

Thanks

Upvotes: 0

Views: 148

Answers (3)

Federico Dipuma
Federico Dipuma

Reputation: 18265

The fastest approach is really bound to I/O capabilities and computational speed. Usually the best method to understand what is the fastest technique is to benchmark them.

Disclaimer: Results are (of course) bound to my machine and may vary significantly on different hardware. For testing I have used a single text file of about 400MB in size. If interested the file may be downloaded here (zipped). Executable compiled as x86.

Option 1: Read entire file, no parallelization

long count = 0;

var text = File.ReadAllText("C:\\tmp\\test.txt");
for(var i = 0; i < text.Length; i++)
if (text[i] == '@')
    count++;

Results:

  • Average execution time: 5828 ms
  • Average process memory: 1674 MB

This is the "naive" approach, which reads the entire file in memory and then uses a for loop (which is significantly faster than foreach or LINQ).

As expected process occupied memory is very high (about 4 times the file size), this may be caused by a combination of string size in memory (more info here) and string processing overhead.

Option 2: Read file in chunks, no parallelization

long count = 0;
using(var file = File.OpenRead("C:\\tmp\\test.txt"))
using(var reader = new StreamReader(file))
{
    const int size = 500000; // chunk size 500k chars
    char[] buffer = new char[size];

    while(!reader.EndOfStream)
    {
        var read = await reader.ReadBlockAsync(buffer, 0, size); // read chunk

        for(var i = 0; i < read; i++)
        if(buffer[i] == '@')
            count++;
    }
}

Results:

  • Average execution time: 4819 ms
  • Average process memory: 7.48 MB

This was unexpected. In this version we are reading the file in chunks of 500k characters instead of loading it entirely in memory, and execution time is even lower than the previous approach. Please note that reducing the chunk size will increase execution time (because of the overhead). Memory consumption is extremely low (as expected, we are only loading roughly 500kB/1MB in memory directly into a char array).

Better (or worse) performance may be obtained by changing the chunk size.

Option 3: Read file in chunks, with parallelization

long count = 0;
using(var file = File.OpenRead("C:\\tmp\\test.txt"))
using(var reader = new StreamReader(file))
{
    const int size = 2000000; // this is roughly 4 times the single threaded value
    const int parallelization = 4; // this will split chunks in sub-chunks processed in parallel
    char[] buffer = new char[size];

    while(!reader.EndOfStream)
    {
        var read = await reader.ReadBlockAsync(buffer, 0, size);

        var sliceSize = read/parallelization;
        var counts = new long[parallelization];

        Parallel.For(0, parallelization, i => {
            var start = i * sliceSize;
            var end = start + sliceSize;

            if(i == parallelization)
                end += read % parallelization;

            long localCount = 0;
            for(var j = start; j < end; j++)
            {
                if(buffer[(int)j] == '@')
                    localCount++;
            }
            counts[i] = localCount;
        });

        count += counts.Sum();
    }
}

Results:

  • Average execution time: 3363 ms
  • Average process memory: 10.37 MB

As expected this version performs better the the single threaded one, but not 4 times better as we could have thought. Memory consumption is again very low compared to the first version (same considerations as before) and we are taking advantage of multi-core environments.

Parameters like chunk size and number of parallel tasks may significantly change the results, you should just go by trial and error to find what is the best combination for you.

Conclusions

I was inclined to think that the "load everything in memory" version was the fastest, but this really depends on the overhead of string processing and I/O speed. The parallel-chunked approach seems the fastest in my machine, this should lead you to an idea: when in doubt just benchmark it.

Upvotes: 4

Rufus L
Rufus L

Reputation: 37020

You can test if it's faster, but a shorter way to write it would be:

int num = File.ReadAllText(filePath).Count(i => i == '@');

Hmm, but I just saw you need the line count as well, so this is similar. Again, would need to be compared to what you have:

var fileLines = File.ReadAllLines(filePath);
var count = fileLines.Length();
var num = fileLines.Sum(line => line.Count(i => i == '@'));

Upvotes: 1

Icemanind
Icemanind

Reputation: 48686

You could use pointers. I don't know if this would be any faster though. You would have to do some testing:

static void Main(string[] args)
{
    string str = "This is @ my st@ing";
    int numberOfCharacters = 0;

    unsafe
    {
        fixed (char *p = str)
        {
            char *ptr = p;
            while (*ptr != '\0')
            {
                if (*ptr == '@')
                    numberOfCharacters++;
                ptr++;
            }
        }
    }

    Console.WriteLine(numberOfCharacters);
}

Note that you must go into your project properties and allow unsafe code in order for this code to work.

Upvotes: -2

Related Questions