Yihong Xiang
Yihong Xiang

Reputation: 75

Copy the whole file from disk to memory to process or read data from file each time i need until the file is all read

I am working on something that cares efficiency very much. There are thousands of files and each file is as large as 300M. Each file contains at least 500 thousand items. My work is to process each item as quick as possible. The physical memory size is not an issue. So. will I benefit from copying the whole file into the memory and get each item from the memory instead of get each item from the disk ? And are there any other methods that can save time in IO process? Thank you!

Upvotes: 1

Views: 1898

Answers (2)

You could use mmap(2), madvise(2), posix_fadvise(2), and readahead(2) syscalls (notice that readahead is Linux specific and blocking, you might want to call it in advance, or in a separate thread).

You might perhaps not care that much: just reading in advance each 200Mb file, a few seconds before processing it, might be enough. The kernel file system and disk cache is doing a lot; with a lot of RAM data would be in memory already.

And you didn't tell us if your program is a single long-lasting process, or if you drive it thru some repetitive script invoking the same program on each big file.

System configuration and hardware does matter a lot. You could configure the file system (at mke2fs time) with large blocks (e.g. 16Kb or 64Kb). If you can afford them, SSD disks would bring a lot.

You could also design your application to carefully use some cleverly setup database.

Upvotes: 4

pmr
pmr

Reputation: 59841

For starters:

std::vector<char> input;
std::ifstream file("filename.txt")'
// maybe find file size and do a reserve on input
std::copy(std::istream_iterator<char>(file), std::istream_iterator<char>()
          std::back_inserter(input));

If this actually ends up being not fast enough for you, memory mapped files are usually reduce a lot of the IO overhead.

The Boost.Iostream library provides portable memory mapped files with a modern interface and is really fast.

Anyway: Try the easy solution first, structure your program to separate the file IO process from the parser and actual processing, then optimize the parts that are actually expensive. Such a program structure will also allow for easy to implement producer/consumer parallelism.

An important part is also what your items are. Can they be directly mapped into a struct or do they have to be processed. If so, how complicated is the actual parsing?

Upvotes: 2

Related Questions