user816318
user816318

Reputation: 55

Sparse reading of 50gb file

I have a 50gb file that I want to read.

There will be x instructions to read from the 50gb file. They will be in sequential order but the locations are unpredictable. Each instruction will read 1 byte.

The total reads account for < 0.0000001% of the total file size, probably around 100 bytes total.

Right now I am using seekg with offsets to do this but it is taking > 3 seconds sometimes.

Would memory mapping the file speed up the read in this case? Does it even make sense to memory map the file if I don't have 50gb of ram?

Is there something else I can do to speed this up?

Here is some code that takes around 2 seconds to run for me ( I adjusted it to make 300 reads to make it take longer):

#include <iostream>
#include <fstream>
#include <set>
#include <cstdlib>

using namespace std;

int main() {

    ifstream in("E:/t.dat", ifstream::binary);
    in.rdbuf()->pubsetbuf(0, 1);
    set<long long> S;
    srand(time(NULL));
    for(int i=0;i<300;i++){
        S.insert((long long)(rand()%50000000)*1000ll);

    }

    long long offset = 0;
    in.seekg(0,ios::beg);
    int sum = 0;
    for(set<long long>::iterator it = S.begin(); it!=S.end(); it++){
        long long toseek = *it - offset;
        while(toseek > 2000000000){
            in.seekg(2000000000,ios::cur);
            toseek -= 2000000000;
            offset += 2000000000;
        }
        in.seekg(toseek,ios::cur);
        offset += toseek;
        char c;
        in.read(&c,1);
        offset++;
        sum += (int)c;
    }
    cout<<sum<<endl;
}

Upvotes: 1

Views: 177

Answers (2)

Adam Rosenfield
Adam Rosenfield

Reputation: 400454

Would memory mapping the file speed up the read in this case?

That's hard to answer without knowing more details about the file access patterns and the OS. Your best bet would be to try it out and measure. But for the non-memory mapping case, I'd recommend disabling buffering setvbuf(3) on the file to avoid reading in any extra data (or alternatively, using the raw Unix file API of open(2)/lseek(2)/read(2)/close(2)). You can also use posix_fadvise(2) to give hints to the OS about how to buffer the file pages -- in your case, you probably want to pass POSIX_FADV_RANDOM to tell it that you'll be accessing the file pages randomly, which might disable readahead behavior which would do unnecessary I/O.

Does it even make sense to memory map the file if I don't have 50gb of ram?

Sure it does, as long as you have enough address space -- this won't work at all for a 32-bit process, but it will be fine in a 64-bit process. The OS will allocate the virtual address space for the entire range of the file, but thanks to demand paging, it won't commit any physical memory to it until you actually read or write any given page. If you happen to touch more pages than can fit in physical memory at once, then the pager will just page out the least recently used pages.

Upvotes: 4

user207421
user207421

Reputation: 310985

Right now I am using seekg with offsets to do this but it is taking > 3 seconds sometimes.

Post the code.

Would memory mapping the file speed up the read in this case? Does it even make sense to memory map the file if I don't have 50gb of ram?

Probably not, and it probably wouldn't perform any better anyway, probably infinitely worse as you will cause swapping on a heroic scale.

Is there something else I can do to speed this up?

Post the code.

Upvotes: 0

Related Questions