Aceonline
Aceonline

Reputation: 344

random reading on very large files with fgets seems to bring Windows caching at it's limits

I have written a C/C++-program for Windows 7 - 64bit that works on very large files. In the final step it reads lines from an input-file (10GB+) and writes them to an output file. The access to the input-file is random, the writing is sequential. EDIT: Main reason for this approach is to reduce RAM usage.

What I basically do in the reading part is this: (Sorry, very shortened and maybe buggy)

void seekAndGetLine(char* line, size_t lineSize, off64_t pos, FILE* filePointer){
  fseeko64(filePointer, pos, ios_base::beg);
  fgets(line, lineSize, filePointer);
}

Normally this code is fine, not to say fast, but under some very special conditions it gets very slow. The behaviour doesn't seem to be deterministic, since the performance drops occure on different machines at other parts of the file or even don't occure at all. It even goes so far, that the program totally stops reading, while there are no disc-operations.

Another sympthom seems to be the used RAM. My process keeps it's RAM steady, but the RAM used by the System grows sometimes very large. After using some RAM-Tools I found out, that the Windows Mapped File grows into several GBs. This behaviour also seems to depend on the hardware, since it occure on different machines at different parts of the process.

As far as I can tell, this problem doesn't exist on SSDs, so it definitely has something to do with the responsetime of the HDD.

My guess is that the Windows Caching gets somehow "wierd". The program is fast as long as the cache does it's work. But when Caching goes wrong, the behaviour goes either into "stop reading" or "grow cache size" and sometimes even both. Since I'm no expert for the windows caching algorithms, I would be happy to hear an explanation. Also, is there any way to get Windows out of C/C++ to manipulate/stop/enforce the caching.

Since I'm hunting this problem for a while now, I've already tried some tricks, that didn't work out:

Thanks in advance

Upvotes: 1

Views: 1431

Answers (2)

twalberg
twalberg

Reputation: 62459

Depending on the larger picture of what your application does, you could possibly take a different approach - maybe something like this:

  1. decide which lines you need from the input file and store the line numbers in a list
  2. sort the list of line numbers
  3. read through the input file once, in order, and pull out the lines you need (better yet, seek to next line and grab it, especially when there's big gaps)
  4. if the list of lines you're grabbing is small enough, you can store them in memory for reordering before output, otherwise, stick them in a smaller temporary file and use that file as input for your current algorithm to reorder the lines for final output

It's definitely a more complex approach, but it would be much kinder to your caching subsystem, and as a result, could potentially perform significantly better.

Upvotes: 3

Adrian McCarthy
Adrian McCarthy

Reputation: 48022

Truly random access across a huge file is the worst possible case for any cache algorithm. It may be best to turn off as much caching as possible.

There are multiple levels of caching:

  • the CRT library (since you're using the f- functions)
  • the OS and filesystem
  • probably onboard the drive itself

If you replace your I/O calls via the f- functions in the CRT with the comparable ones in the Windows API (e.g., CreateFile, ReadFile, etc.) you can eliminate the CRT caching, which may be doing more harm than good. You can also warn the OS that you're going to be doing random accesses, which affects its caching strategy. See options like FILE_FLAG_RANDOM_ACCESS and possibly FILE_FLAG_NO_BUFFERING.

You'll need to experiment and measure.

You might also have to reconsider how your algorithm works. Are the seeks truly random? Can you re-sequence them, perhaps in batches, so that they're in order? Can you limit access to a relatively small region of the file at a time? Can you break the huge file into smaller files and then work with one piece at a time? Have you checked the level of fragmentation on the drive and on the particular file?

Upvotes: 3

Related Questions