I'd like to read and process (e.g. print) entries from the first row of a CSV file one at a time. I assume Unix-style \n newlines, that no entry is longer than 255 chars and (for now) that there's a newline before EOF. This is meant to be a more efficient alternative to fgets() followed by strtok() . #include <stdio.h> #include <string.h> int main() { int i; char ch, buf[256]; FILE *fp = fopen("test.csv", "r"); for (;;) { for (i = 0; ; i++) { ch = fgetc(fp); if (ch == ',') { buf[i] = '\0'; puts(buf); break; } else if (ch == '\n') { buf[i] = '\0'; puts(buf); fclose(fp); return 0; } else buf[i] = ch; } } } Is this method as efficient and correct as possible? What is the best way to test for EOF and file reading errors using this method? (Possibilities: testing against the character macro EOF , feof() , ferror() , etc.). Can I perform the same task using C++ file I/O with no loss of efficiency ?

What is most efficient is going to depend a lot on the operating system, standard libraries (e.g. libc ), and even the hardware you are running on. This makes it nearly impossible to tell you what's "most efficient". That having been said, there are a few things you could try: Use mmap() or a local operating system equivalent (Windows has CreateFileMapping / OpenFileMapping / MapViewOfFile , and probably others). Then you don't do explicit file reads: you simply access the file as if it were already in memory, and anything that isn't there will be faulted in by the page fault mechanism. Read the entire file into a buffer manually, then work on that buffer. The fewer times you call into file read functions, the fewer function-call overheads you take, and likely also fewer application/OS domain switches. Obviously this uses more memory, but may very well be worth it. Use a more optimal string scanner for your problem and platform. Going character-by-character yourself is almost never as fast as relying on something existing that's close to your problem domain. For example, you can bet that strchr and memchr are probably better-optimized than most code you can roll yourself, doing things like reading entire cachelines or words at once, scanning using better algorithms for this kind of search, etc. For more complicated cases, you might consider a full regular expression engine that could compile your regex to something fast for your complicated case. Avoid copying your string around. It may be helpful to think in terms of "find delimiters" and then "output between delimiters". You could for example use strchr to find the next character of interest, and then fwrite or something to write to stdout directly from your input buffer. Then you're keeping most of your work in a few local registers, rather than using a stack or heap buf . When in doubt, though, try a few possibilities and profile, profile, profile. Also for this kind of problem, be very aware of differences between runs that are caused by OS and hardware caches: profile a bunch of runs rather than just one after each change -- and if possible, use tests that will either likely always hit caches (if you're trying to measure best-case performance) or tests that will likely miss (if you're trying to measure worst-case performance). Regarding C++ file IO ( fstream and such), just be aware that they're larger, more complicated beasts. They tend to include things such as locale management, automatic buffering, and the like -- as well as being less prone to particular types of coding mistakes. If you're doing something pretty simple (like what you describe here), I tend to find C++ library stuff gets in the way. (Use a debugger and "step instruction" through a stringstream method versus some C string functions some time, you'll get a good feel for this quickly.) It all depends on whether you're going to want or need that additional functionality or safety in the future. Finally, the obligatory "don't sweat the small stuff". Only spend time on optimizing here if it's really important. Otherwise trust the libraries and OS to do the right thing for you most of the time -- if you get too far into micro-optimizations you'll find you're shooting yourself in the foot later. This is not to discourage you from thinking in terms of "should I read the whole file in ahead of time, will that break future use cases" -- because that's macro, rather than micro. But generally speaking if you're not doing this kind of "make it faster" investigation for a good reason -- i.e. "need this app to perform better now that I've written it, and this code shows up as slow in profiler", or "doing this for fun so I can better understand the system" -- well, spend your time elsewhere first. =)

Proper, efficient file reading

Goz

Reputation: 62333

One method, provided you are going to scan through the file serially, is to use 2 buffers of a decent enough size (16K is the optimal size for SSDs and 4K for HDDs IIRC. But 16K should suffice for both). You start off by performing an asynchronous load (In windows look up Overlapped I/O and on Unix/OSX use O_NONBLOCK) of the first 16K into buffer 0 and then start another load into buffer 1 of bytes 16K to 32K. When your read position hits 16K, swap the buffers (so you are now reading from buffer 1 instead) wait for any further loads to complete into buffer 1 and perform an asynchronous load of bytes 32K to 48K into buffer 0 and so on. This way, you have far less chance of ever having to wait for a load to complete as it should be happening while you are processing the previous 16K.

I moved over to a scheme like this in my XML parser having been using fopen and fgetc previously and the speedup was huge. Loading in a 15 meg XML file and processing it reduced from minutes to seconds. Of course, Your milage may vary.

Upvotes: 3

leander

Reputation: 8737

What is most efficient is going to depend a lot on the operating system, standard libraries (e.g. libc), and even the hardware you are running on. This makes it nearly impossible to tell you what's "most efficient".

That having been said, there are a few things you could try:

Use mmap() or a local operating system equivalent (Windows has CreateFileMapping / OpenFileMapping / MapViewOfFile, and probably others). Then you don't do explicit file reads: you simply access the file as if it were already in memory, and anything that isn't there will be faulted in by the page fault mechanism.
Read the entire file into a buffer manually, then work on that buffer. The fewer times you call into file read functions, the fewer function-call overheads you take, and likely also fewer application/OS domain switches. Obviously this uses more memory, but may very well be worth it.
Use a more optimal string scanner for your problem and platform. Going character-by-character yourself is almost never as fast as relying on something existing that's close to your problem domain. For example, you can bet that strchr and memchr are probably better-optimized than most code you can roll yourself, doing things like reading entire cachelines or words at once, scanning using better algorithms for this kind of search, etc. For more complicated cases, you might consider a full regular expression engine that could compile your regex to something fast for your complicated case.
Avoid copying your string around. It may be helpful to think in terms of "find delimiters" and then "output between delimiters". You could for example use strchr to find the next character of interest, and then fwrite or something to write to stdout directly from your input buffer. Then you're keeping most of your work in a few local registers, rather than using a stack or heap buf.

When in doubt, though, try a few possibilities and profile, profile, profile.

Also for this kind of problem, be very aware of differences between runs that are caused by OS and hardware caches: profile a bunch of runs rather than just one after each change -- and if possible, use tests that will either likely always hit caches (if you're trying to measure best-case performance) or tests that will likely miss (if you're trying to measure worst-case performance).

Regarding C++ file IO (fstream and such), just be aware that they're larger, more complicated beasts. They tend to include things such as locale management, automatic buffering, and the like -- as well as being less prone to particular types of coding mistakes.

If you're doing something pretty simple (like what you describe here), I tend to find C++ library stuff gets in the way. (Use a debugger and "step instruction" through a stringstream method versus some C string functions some time, you'll get a good feel for this quickly.)

It all depends on whether you're going to want or need that additional functionality or safety in the future.

Finally, the obligatory "don't sweat the small stuff". Only spend time on optimizing here if it's really important. Otherwise trust the libraries and OS to do the right thing for you most of the time -- if you get too far into micro-optimizations you'll find you're shooting yourself in the foot later. This is not to discourage you from thinking in terms of "should I read the whole file in ahead of time, will that break future use cases" -- because that's macro, rather than micro.

But generally speaking if you're not doing this kind of "make it faster" investigation for a good reason -- i.e. "need this app to perform better now that I've written it, and this code shows up as slow in profiler", or "doing this for fun so I can better understand the system" -- well, spend your time elsewhere first. =)

Upvotes: 5

Proper, efficient file reading

Answers (3)

Related Questions