juergen861
juergen861

Reputation: 29

Extreme performance difference, when reading same files a second time with C

I have to read binary data into char-arrays from large (2GB) binary files in a C++ program. When reading the files for the first time from my SSD, reading takes about 6.4 seconds per file. But when running the same code again or even after running a different dummy-program, which does almost the same before, the next readings take only about 1.4 seconds per file. The Windows Task Manager even shows much less disk-activity on the second, third, fourth… run. So, my guess is Window’s File Caching is sparing me from waiting for data from the SSD, when filling the arrays another time.

Is there any clean option to read the files into file cache before the customer runs the software? Any better option than just already loading the files with fread in advance? And how can I make sure, the data remains in the File Cache until I need it?

Or am I totally wrong with my File Cache assumption? Is there another (better) explanation for these different loading times?

Upvotes: 0

Views: 529

Answers (1)

fer-rum
fer-rum

Reputation: 211

Educated guess here: You most likely are right with your file cache assumption.

Can you pre load files before the user runs the software? Not directly. How would your program be supposed to know that it is going to be run in the next few minutes?

So you probably need a helper mechanism or tricks. The options I see here are:

  • Indexing mechanisms to provide a faster and better aimed access to your data. This is helpful if you only need small chunks of information from these data at once.
  • Attempt to parallelize the loading of the data, so even if it does not really get faster, the user has the impression it does because he can start working already with the data he has, while the rest is fetched in the background.
  • Have a helper tool starting up with the OS and pre-fetching everything, so you already have it in memory when required. Caution: This has serious implications since you reserve either a large chunk of RAM or even SSD-cache (depending on implementation) for your tool from the start. Only consider doing this if the alternative is the apocalypse…

You can also try to combine the first two options. The key to a faster data availability is to figure out what to read in which order instead of trying to load everything at once en-bloc. Divide and Conquer.

Without further details on the problem it is impossible to provide more specific solutions though.

Upvotes: 2

Related Questions