Reputation: 10220
On Linux 64-bit (such as Amazon EC2 instance), I need to load a couple large binary files into memory. What is the fastest way?
Also, the node may or may not launch this executable a second time, so it helps if the file loaded even faster on subsequent attempts. Some sort of pre-loading step may even work.
Upvotes: 8
Views: 3760
Reputation: 14205
You may try mmap
with the MAP_POPULATE
flag. I doubt you can do this any faster.
Upvotes: 0
Reputation: 129314
Given the information above, I'd say mmap
is a good candidate. There a few reasons I say that:
1. It gives you the WHOLE file without actually loading (any of) the file until that part is actually needed. This is an advantage for fast loading, but if you eventually will have gone through every byte [or touched on every 4KB section of the file], then there's no great difference.
2. The mmap
will only copy the data ONCE from the disk to your pages. This is more efficient in my testing than reading using fread
or read
in Linux (note also that the difference between fread
and read
for reasonably large reads can be ignored. There is very little extra overhead in FILE
functions in C. C++ streams to add a fair bit of overhead, however, in my experience [I have tried various forms of this several times by now].
Like always, benchmarking always trumps asking on the internet. So you MAY find that what I've said above is not right in your circumstances. And as pointed out, once the code is sufficiently good, any overhead in the code is dwarfed by the speed that disks can deliver data - even if you have a very fancy RAID system with lots of parallel (SSD?) disks, etc, eventually the disk transfer speed is going to be where the bottleneck is. All you can do at that point is to try to have as little other overhead as possible, and get the data to the application as quickly as possible after the disk has delivered the data.
A good benchmark for "bytes per second" is to use dd if=/dev/zero of=somefile bs=4K count=1M
(that writes a file, you may then want to dd if=somefile of=/dev/null bs=4K
to see how well you can read from the disk.
Upvotes: 0
Reputation: 11582
The time is going to be dominated by disk I/O, so which API you use is not as important as thinking about how a disk works. If you access a disk (rotating media) randomly it will cost 3 to 9 milliseconds to seek... once the disk is streaming it can sustain about 128 MB/sec, that is how fast bits will be coming off the disk head. The SATA link or PCIe bus have much higher bandwidth than that (600 to 2000 MB/sec). Linux has a page cache in memory where it keeps a copy of pages on the disk, so provided your machine has adequate amounts of RAM subsequent attempts will be fast, even if you then access the data randomly. So the advice is read large blocks at a time. If you really want to speed up the initial load then you could use mmap to map the entire file (1GB-4GB) and have a helper thread that reads the 1st byte of each page in order.
You can read more about disk drive performance characteristics here.
You can read more about the page cache here.
Upvotes: 6