user12986714
user12986714

Reputation: 741

Prefetch with file backed mmap in c

I am writing some performance critical code (i.e. in a very tight loop and indicated by profiling) whose logic is basically (the_key is a parameter and mmap_base is the base address of a memory mapped file):

while (current_item && (struct my_struct *)(mmap_base + current_item) -> key < the_key){
    /* Do something less performance critical */
    current_item = (struct my_struct *)(mmap_base + current_item) -> next;
}

Profiling indicates that this piece of code is disk-bounded on dereferencing (mmap_base + current_item), which makes sense as random disk IO is considerably slow.

It is impossible to load relevant portion in mmap into the memory as the file is huge at about 100 GB. I am thinking about using something like __builtin_prefetch():

while (current_item && (struct my_struct *)(mmap_base + current_item) -> key < the_key){
    __builtin_prefetch(mmap_base + ((struct my_struct *)(mmap_base + current_item) -> next), 0, 0);
    /* Do something less performance critical */
    current_item = (struct my_struct *)(mmap_base + current_item) -> next;
}

However, this will not work. It looks like __builtin_prefetch() is not anyway useful on mmap-ed memory.
I then tried madvise():

while (current_item && (struct my_struct *)(mmap_base + current_item) -> key < the_key){
    madvise(mmap_base + ((struct my_struct *)(mmap_base + current_item) -> next), sizeof(struct my_struct), MADV_WILLNEED);
    /* Do something less performance critical */
    current_item = (struct my_struct *)(mmap_base + current_item) -> next;
}

However, this even decreased the performance, and profiling showed that madvise() call now becomes the major overhead.

Is there some compiler builtins (x86_64, GCC) or other ways to tell the kernel (linux) to prefetch data from disk into the memory/CPU cache?

Edit 1:
Some suggested that this is simply impossible without improving data locality. In such a case, however, I do wonder that why is it impossible to make an asynchronous read to the disk while moving on to the "less performance critical" part, which should allow faster access; is it more about kernel not implementing this or just theoretical/physical restrictions?

Edit 2:
Some recommended to use a separate thread to pre-access the memory so to let the kernel to prefetch them. However, I think threads can be expensive. Is it really helpful to start a thread for each prefetch? The code is in a tight loop, so that may mean a lot of threads will need to be started/joined. On the other hand, if I only use one thread, how should I communicate with it about what to prefetch?

Upvotes: 0

Views: 972

Answers (1)

Gonbidatu
Gonbidatu

Reputation: 66

This type of access pattern will always be slow, because it potentially jumps around, without any sensible way to predict the pattern.

The approach I would try, is to generate a separate memory-mapped key index file, with just the key values and the offset of the corresponding record; with keys sorted in increasing order. That way, to find a specific key takes roughly O(log N) time complexity (depending on how you deal with duplicate keys), using a very simple binary search.

If the keys in the 100 GB file are being modified during operation, a single flat file is unsuitable for describing the data.

If you can handle the code complexity, partitioned binary search trees in array form have even better performance. In that case, you split the index file into fixed-size parts, say 64 kB (4096 key-offset pairs), containing in array form a rectangular part of the perfectly balanced binary search tree. For example, the very first partition contains the middle keys, the 1/4 and 3/4 keys, the 1/8, 3/8, 5/8, and 7/8 keys, and so on. Furthermore, you only include the keys in the primary index file, and use a secondary index file for the record offsets. (If you have duplicate keys, have the secondary index file refer to the first one, with each duplicate second index file entry referring to the next one, so you can track the chain directly with small time penalty but no extra space cost.)

This has much better locality than a binary search on a sorted array has, but the code and logic complexity is a bit daunting.

Upvotes: 1

Related Questions