Reputation: 6605
I have a txt file. Right now, I need to load it line by line, and check how many times a '@' are in the entire file.
So, basically, I have a single line string, how to get the # of occurances of '@' fast?
I need to count this fast since we have lots of files like this and each of them are about 300-400MB.
I searched, it seems the straightforward way is the fastest way to do this:
int num = 0;
foreach (char c in line)
{
if (c == '@') num++;
}
Is there a different method that could be faster than this? Any other suggestions?
Thanks
Upvotes: 0
Views: 148
Reputation: 18265
The fastest approach is really bound to I/O capabilities and computational speed. Usually the best method to understand what is the fastest technique is to benchmark them.
Disclaimer: Results are (of course) bound to my machine and may vary significantly on different hardware. For testing I have used a single text file of about 400MB in size. If interested the file may be downloaded here (zipped). Executable compiled as x86.
long count = 0;
var text = File.ReadAllText("C:\\tmp\\test.txt");
for(var i = 0; i < text.Length; i++)
if (text[i] == '@')
count++;
Results:
5828 ms
1674 MB
This is the "naive" approach, which reads the entire file in memory and then uses a for
loop (which is significantly faster than foreach
or LINQ).
As expected process occupied memory is very high (about 4 times the file size), this may be caused by a combination of string size in memory (more info here) and string processing overhead.
long count = 0;
using(var file = File.OpenRead("C:\\tmp\\test.txt"))
using(var reader = new StreamReader(file))
{
const int size = 500000; // chunk size 500k chars
char[] buffer = new char[size];
while(!reader.EndOfStream)
{
var read = await reader.ReadBlockAsync(buffer, 0, size); // read chunk
for(var i = 0; i < read; i++)
if(buffer[i] == '@')
count++;
}
}
Results:
4819 ms
7.48 MB
This was unexpected. In this version we are reading the file in chunks of 500k characters instead of loading it entirely in memory, and execution time is even lower than the previous approach. Please note that reducing the chunk size will increase execution time (because of the overhead). Memory consumption is extremely low (as expected, we are only loading roughly 500kB/1MB in memory directly into a char array).
Better (or worse) performance may be obtained by changing the chunk size.
long count = 0;
using(var file = File.OpenRead("C:\\tmp\\test.txt"))
using(var reader = new StreamReader(file))
{
const int size = 2000000; // this is roughly 4 times the single threaded value
const int parallelization = 4; // this will split chunks in sub-chunks processed in parallel
char[] buffer = new char[size];
while(!reader.EndOfStream)
{
var read = await reader.ReadBlockAsync(buffer, 0, size);
var sliceSize = read/parallelization;
var counts = new long[parallelization];
Parallel.For(0, parallelization, i => {
var start = i * sliceSize;
var end = start + sliceSize;
if(i == parallelization)
end += read % parallelization;
long localCount = 0;
for(var j = start; j < end; j++)
{
if(buffer[(int)j] == '@')
localCount++;
}
counts[i] = localCount;
});
count += counts.Sum();
}
}
Results:
3363 ms
10.37 MB
As expected this version performs better the the single threaded one, but not 4 times better as we could have thought. Memory consumption is again very low compared to the first version (same considerations as before) and we are taking advantage of multi-core environments.
Parameters like chunk size and number of parallel tasks may significantly change the results, you should just go by trial and error to find what is the best combination for you.
I was inclined to think that the "load everything in memory" version was the fastest, but this really depends on the overhead of string processing and I/O speed. The parallel-chunked approach seems the fastest in my machine, this should lead you to an idea: when in doubt just benchmark it.
Upvotes: 4
Reputation: 37020
You can test if it's faster, but a shorter way to write it would be:
int num = File.ReadAllText(filePath).Count(i => i == '@');
Hmm, but I just saw you need the line count as well, so this is similar. Again, would need to be compared to what you have:
var fileLines = File.ReadAllLines(filePath);
var count = fileLines.Length();
var num = fileLines.Sum(line => line.Count(i => i == '@'));
Upvotes: 1
Reputation: 48686
You could use pointers. I don't know if this would be any faster though. You would have to do some testing:
static void Main(string[] args)
{
string str = "This is @ my st@ing";
int numberOfCharacters = 0;
unsafe
{
fixed (char *p = str)
{
char *ptr = p;
while (*ptr != '\0')
{
if (*ptr == '@')
numberOfCharacters++;
ptr++;
}
}
}
Console.WriteLine(numberOfCharacters);
}
Note that you must go into your project properties and allow unsafe code in order for this code to work.
Upvotes: -2