Reputation: 133
I'm writing a small program for Wayland that uses software rendering and wl_shm for display. This requires that I pass a file descriptor for my screen buffer to the Wayland server, which then calls mmap()
on it, i.e. the screen buffer must be shareable between processes.
In this program, startup latency is key. Currently, there is only one remaining bottleneck: the initial draw to the screen buffer, where the entire buffer is painted over. The code below shows a simplified version of this:
#define _GNU_SOURCE
#include <unistd.h>
#include <sys/mman.h>
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
int main()
{
/* Fullscreen buffers are around 10-30 MiB for common resolutions. */
const size_t size = 2880 * 1800 * 4;
int fd = memfd_create("shm", 0);
ftruncate(fd, size);
void *pool = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
/* Ideally, we could just malloc, but this memory needs to be shared. */
//void *pool = malloc(size);
/* In reality this is a cairo_paint() call. */
memset(pool, 0xCF, size);
/* Subsequent paints (or memsets) after the first take negligible time. */
}
On my laptop, the memset()
above takes around 21-28 ms. Switching to malloc()
'ed memory drops this to 12 ms, but the problem is that the memory needs to be shared between processes. The behaviour is similar on my desktop: 7 ms for mmap()
, 3 ms for malloc()
.
My question is: Is there something I'm missing that can improve the performance of shared memory on Linux? I've tried madvise()
with MADV_WILLNEED
and MADV_SEQUENTIAL
, and using mlock()
, but none of those made a difference. I've also thought about whether 2MB Huge Pages would help given the buffer sizes of around 10-30 MB, but that's not usually available.
Edit: I've tried mmap()
with MAP_ANONYMOUS | MAP_SHARED
, which is just as slow as before. MAP_ANONYMOUS | MAP_PRIVATE
results in the same speed as malloc()
, however that defeats the purpose.
Upvotes: 6
Views: 1107
Reputation: 133
The difference in performance between malloc()
and mmap()
seems to be due to the differing application of Transparent Hugepages.
By default on x86_64, the page size is 4KiB and the huge page size is 2MiB. Transparent Hugepages allows programs that don't know about hugepages to still use them, reducing page faults. This is only enabled by default for private, anonymous memory however - thus for both malloc()
and mmap()
with MAP_ANONYMOUS | MAP_PRIVATE
set, explaining why the performance of these is identical. For shared memory mappings, this is disabled, resulting in more page handling overhead (for the 10-30MiB buffers I need), and causing slowdowns.
Hugepages can be enabled for shared memory mappings, as explained in the kernel docs page, via the /sys/kernel/mm/transparent_hugepage/shmem_enabled
knob. This defaults to never
, but setting it to always
(or advise
, and adding the corresponding madvise(..., MADV_HUGEPAGE)
call) allows memory mapped with MAP_SHARED
to use hugepages, and the performance matches malloc()
'ed memory.
I'm unsure why the default is never
for shared memory. While not very satisfactory, for now it seems the only solution is to use madvise(MADV_HUGEPAGE)
to improve performance on any systems which happen to have shmem_enabled
set to at least advise
(or if it's enabled by default in future).
Upvotes: 5