Reputation: 344
I have written a C/C++-program for Windows 7 - 64bit that works on very large files. In the final step it reads lines from an input-file (10GB+) and writes them to an output file. The access to the input-file is random, the writing is sequential. EDIT: Main reason for this approach is to reduce RAM usage.
What I basically do in the reading part is this: (Sorry, very shortened and maybe buggy)
void seekAndGetLine(char* line, size_t lineSize, off64_t pos, FILE* filePointer){
fseeko64(filePointer, pos, ios_base::beg);
fgets(line, lineSize, filePointer);
}
Normally this code is fine, not to say fast, but under some very special conditions it gets very slow. The behaviour doesn't seem to be deterministic, since the performance drops occure on different machines at other parts of the file or even don't occure at all. It even goes so far, that the program totally stops reading, while there are no disc-operations.
Another sympthom seems to be the used RAM. My process keeps it's RAM steady, but the RAM used by the System grows sometimes very large. After using some RAM-Tools I found out, that the Windows Mapped File grows into several GBs. This behaviour also seems to depend on the hardware, since it occure on different machines at different parts of the process.
As far as I can tell, this problem doesn't exist on SSDs, so it definitely has something to do with the responsetime of the HDD.
My guess is that the Windows Caching gets somehow "wierd". The program is fast as long as the cache does it's work. But when Caching goes wrong, the behaviour goes either into "stop reading" or "grow cache size" and sometimes even both. Since I'm no expert for the windows caching algorithms, I would be happy to hear an explanation. Also, is there any way to get Windows out of C/C++ to manipulate/stop/enforce the caching.
Since I'm hunting this problem for a while now, I've already tried some tricks, that didn't work out:
Thanks in advance
Upvotes: 1
Views: 1431
Reputation: 62459
Depending on the larger picture of what your application does, you could possibly take a different approach - maybe something like this:
It's definitely a more complex approach, but it would be much kinder to your caching subsystem, and as a result, could potentially perform significantly better.
Upvotes: 3
Reputation: 48022
Truly random access across a huge file is the worst possible case for any cache algorithm. It may be best to turn off as much caching as possible.
There are multiple levels of caching:
If you replace your I/O calls via the f- functions in the CRT with the comparable ones in the Windows API (e.g., CreateFile, ReadFile, etc.) you can eliminate the CRT caching, which may be doing more harm than good. You can also warn the OS that you're going to be doing random accesses, which affects its caching strategy. See options like FILE_FLAG_RANDOM_ACCESS
and possibly FILE_FLAG_NO_BUFFERING
.
You'll need to experiment and measure.
You might also have to reconsider how your algorithm works. Are the seeks truly random? Can you re-sequence them, perhaps in batches, so that they're in order? Can you limit access to a relatively small region of the file at a time? Can you break the huge file into smaller files and then work with one piece at a time? Have you checked the level of fragmentation on the drive and on the particular file?
Upvotes: 3