Reputation: 7251
The idea is relatively simple, but I see some complications for implementations, so I'm wondering if it's even possible right now.
Of course the complication I mentioned is that the buffer should be deallocated and unmapped (since the data is now under the responsibility of the system cache), and I don't know how to do that either.
The important aspects are that:
One assumption is that:
So, (1) are these requirements too constrained? (2) Is this assumption correct?
PS: I had to arbitrarily pick the tags for this question, but I'm also interested in hearing how BSDs and Windows would do this. Of course if the POSIX API allows to do this already, that would be great.
Update: I call a buffer a space of private memory (private to the process/task in any OS with normal VMM) allocated in primary memory. The high-level goal involves generating a dataset of an arbitrary size using another input (in my case the network), then once it's generated, make it accessible for long periods of time (to the network and to the process itself), saving it to disk in the process.
The alternative that I see is to write or use a full-blown userland cache reading and writing to the disk itself to ensure that (a) pages don't get uselessly swapped out and (b) the process doesn't hold too much memory for itself, which is never possible to do optimally anyway (better let the kernel do its job), and which is simply not a road I think is worth going down (too complex for less gains).
Update: Requirements 2 and 3 are non-issues considering Nominal Animal's answer. Of course this implies that the assumption is incorrect, as he proved is almost the case (overhead is minimal). I also relaxed requirement 1, O_TMPFILE
is indeed perfect for this.
Update: A recent article on LWN mentions, somewhere in the middle: "That could possibly be done with a special write operation that would not actually cause I/O, or with a system call that would transfer a physical page into the page cache". That suggests that indeed, there is currently (April 2014) no way to do this at least with Linux (and likely other operating systems), much less with a standard API. The article is about PostgreSQL, but the issue in question is identical, except perhaps for the specific requirements to this question, which aren't defined in the article.
Upvotes: 3
Views: 1735
Reputation: 39298
This is not a satisfactory answer to the question; is is more of a continuation of the comment chain.
Here is a test program one can use to measure the overhead of using a file-backed memory map, instead of an anonymous memory map.
Note that the work()
function listed just fills in the memory map with random data. To be more realistic, it should simulate at least the access patterns expected from real-world usage.
#define _POSIX_C_SOURCE 200809L
#define _GNU_SOURCE
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <sys/mman.h>
#include <fcntl.h>
#include <time.h>
#include <stdint.h>
#include <string.h>
#include <errno.h>
#include <stdio.h>
/* Xorshift random number generator.
*/
static uint32_t xorshift_state[4] = {
123456789U,
362436069U,
521288629U,
88675123U
};
static int xorshift_setseed(const void *const data, const size_t len)
{
uint32_t state[4] = { 0 };
if (len < 1)
return ENOENT;
else
if (len < sizeof state)
memcpy(state, data, len);
else
memcpy(state, data, sizeof state);
if (state[0] || state[1] || state[2] || state[3]) {
xorshift_state[0] = state[0];
xorshift_state[1] = state[1];
xorshift_state[2] = state[2];
xorshift_state[3] = state[3];
return 0;
}
return EINVAL;
}
static uint32_t xorshift_u32(void)
{
const uint32_t temp = xorshift_state[0] ^ (xorshift_state[0] << 11U);
xorshift_state[0] = xorshift_state[1];
xorshift_state[1] = xorshift_state[2];
xorshift_state[2] = xorshift_state[3];
return xorshift_state[3] ^= (temp >> 8U) ^ temp ^ (xorshift_state[3] >> 19U);
}
/* Wallclock timing functions.
*/
static struct timespec wallclock_started;
static void wallclock_start(void)
{
clock_gettime(CLOCK_REALTIME, &wallclock_started);
}
static double wallclock_stop(void)
{
struct timespec wallclock_stopped;
clock_gettime(CLOCK_REALTIME, &wallclock_stopped);
return difftime(wallclock_stopped.tv_sec, wallclock_started.tv_sec)
+ (double)(wallclock_stopped.tv_nsec - wallclock_started.tv_nsec) / 1000000000.0;
}
/* Accessor function. This needs to read/modify/write the mapping,
* simulating the actual work done onto the mapping.
*/
static void work(void *const area, size_t const length)
{
uint32_t *const data = (uint32_t *)area;
size_t size = length / sizeof data[0];
size_t i;
/* Add xorshift data. */
for (i = 0; i < size; i++)
data[i] += xorshift_u32();
}
int main(int argc, char *argv[])
{
long page, size, delta, maxsize, steps;
int fd, result;
void *map, *old;
char dummy;
double seconds;
page = sysconf(_SC_PAGESIZE);
if (argc < 5 || argc > 6 || !strcmp(argv[1], "-h") || !strcmp(argv[1], "--help")) {
fprintf(stderr, "\n");
fprintf(stderr, "Usage: %s [ -h | --help ]\n", argv[0]);
fprintf(stderr, " %s MAPFILE SIZE DELTA MAXSIZE [ SEEDSTRING ]\n", argv[0]);
fprintf(stderr, "Where:\n");
fprintf(stderr, " MAPFILE backing file, '-' for none\n");
fprintf(stderr, " SIZE initial map size\n");
fprintf(stderr, " DELTA map size change\n");
fprintf(stderr, " MAXSIZE final size of the map\n");
fprintf(stderr, " SEEDSTRING seeds the Xorshift PRNG\n");
fprintf(stderr, "Note: sizes must be page aligned, each page being %ld bytes.\n", (long)page);
fprintf(stderr, "\n");
return 1;
}
if (argc >= 6) {
if (xorshift_setseed(argv[5], strlen(argv[5]))) {
fprintf(stderr, "%s: Invalid seed string for the Xorshift generator.\n", argv[5]);
return 1;
} else {
fprintf(stderr, "Xorshift initialized with { %lu, %lu, %lu, %lu }.\n",
(unsigned long)xorshift_state[0],
(unsigned long)xorshift_state[1],
(unsigned long)xorshift_state[2],
(unsigned long)xorshift_state[3]);
fflush(stderr);
}
}
if (sscanf(argv[2], " %ld %c", &size, &dummy) != 1) {
fprintf(stderr, "%s: Invalid map size.\n", argv[2]);
return 1;
} else
if (size < page || size % page) {
fprintf(stderr, "%s: Map size must be a multiple of page size (%ld).\n", argv[2], page);
return 1;
}
if (sscanf(argv[3], " %ld %c", &delta, &dummy) != 1) {
fprintf(stderr, "%s: Invalid map size change.\n", argv[2]);
return 1;
} else
if (delta % page) {
fprintf(stderr, "%s: Map size change must be a multiple of page size (%ld).\n", argv[3], page);
return 1;
}
if (delta) {
if (sscanf(argv[4], " %ld %c", &maxsize, &dummy) != 1) {
fprintf(stderr, "%s: Invalid final map size.\n", argv[3]);
return 1;
} else
if (maxsize < page || maxsize % page) {
fprintf(stderr, "%s: Final map size must be a multiple of page size (%ld).\n", argv[4], page);
return 1;
}
steps = (maxsize - size) / delta;
if (steps < 0L)
steps = -steps;
} else {
maxsize = size;
steps = 0L;
}
/* Time measurement includes the file open etc. overheads.
*/
wallclock_start();
if (strlen(argv[1]) < 1 || !strcmp(argv[1], "-"))
fd = -1;
else {
do {
fd = open(argv[1], O_RDWR | O_CREAT | O_EXCL, 0600);
} while (fd == -1 && errno == EINTR);
if (fd == -1) {
fprintf(stderr, "%s: %s.\n", argv[1], strerror(errno));
return 1;
}
do {
result = ftruncate(fd, (off_t)size);
} while (result == -1 && errno == EINTR);
if (result == -1) {
fprintf(stderr, "%s: %s.\n", argv[1], strerror(errno));
unlink(argv[1]);
do {
result = close(fd);
} while (result == -1 && errno == EINTR);
return 1;
}
result = posix_fadvise(fd, 0, size, POSIX_FADV_RANDOM);
}
/* Initial mapping. */
if (fd == -1)
map = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, fd, 0);
else
map = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_NORESERVE, fd, 0);
if (map == MAP_FAILED) {
fprintf(stderr, "Memory map failed: %s.\n", strerror(errno));
if (fd != -1) {
unlink(argv[1]);
do {
result = close(fd);
} while (result == -1 && errno == EINTR);
}
return 1;
}
result = posix_madvise(map, size, POSIX_MADV_RANDOM);
work(map, size);
while (steps-->0L) {
if (fd != -1) {
do {
result = ftruncate(fd, (off_t)(size + delta));
} while (result == -1 && errno == EINTR);
if (result == -1) {
fprintf(stderr, "%s: Cannot grow file: %s.\n", argv[1], strerror(errno));
unlink(argv[1]);
do {
result = close(fd);
} while (result == -1 && errno == EINTR);
return 1;
}
result = posix_fadvise(fd, 0, size, POSIX_FADV_RANDOM);
}
old = map;
map = mremap(map, size, size + delta, MREMAP_MAYMOVE);
if (map == MAP_FAILED) {
fprintf(stderr, "Cannot remap memory map: %s.\n", strerror(errno));
munmap(old, size);
if (fd != -1) {
unlink(argv[1]);
do {
result = close(fd);
} while (result == -1 && errno == EINTR);
}
return 1;
}
size += delta;
result = posix_madvise(map, size, POSIX_MADV_RANDOM);
work(map, size);
}
/* Timing does not include file renaming.
*/
seconds = wallclock_stop();
munmap(map, size);
if (fd != -1) {
unlink(argv[1]);
do {
result = close(fd);
} while (result == -1 && errno == EINTR);
}
printf("%.9f seconds elapsed.\n", seconds);
return 0;
}
If you save the above as bench.c
, you can compile it using
gcc -W -Wall -O3 bench.c -lrt -o bench
Run it without parameters to see the usage.
On my machine, on ext4 filesystem, running tests
./bench - 4096 4096 4096000
./bench testfile 4096 4096 4096000
yields 1.307 seconds wall clock time for the anonymous memory map, and 1.343 seconds for the file-backed memory map, meaning the file backed mapping is about 2.75% slower.
This test starts with one page memory map, then enlarges it by one page a thousand times. For tests like 4096000 4096 8192000
the difference is even smaller. The time measured does include constructing the initial file (and using posix_fallocate()
to allocate the blocks on disk for the file).
Running the test on tmpfs, on ext4 over swRAID0, and on ext4 over swRAID1, on the same machine, does not seem to affect the results; all differences are lost in the noise.
While I would prefer to test this on multiple machines and kernel versions before making any sweeping statements, I do know something about how the kernel manages these memory maps. Therefore, I shall make the following claim, based on above and my own experience:
Using a file-backed memory map will not cause a significant slowdown compared to an anonymous memory map, or even compared to malloc()
/realloc()
/free()
. I expect the difference to be under 5% in all real-world use cases, and at most 1% for typical real-world use cases; less, if the resizes are rare compared to how often the map is accessed.
To user2266481 the above means it should be acceptable to just create a temporary file on the target filesystem, to hold the memory map. (Note that it is possible to create the temporary file without allowing anyone access to it, mode 0, as access mode is only checked when opening the file.) When the contents are in final form, ftruncate()
and msync()
the contents, then hard-link the final file to the temporary file using link()
. Finally, unlink the temporary file and close the temporary file descriptor, and the task should be completed with near-optimal efficiency.
Upvotes: 4