mmap() on fd given by memfd_create() sometimes fails with Bad file descriptor

Question

I have two processes, a client and a server.

The server creates an anonymous file using the Linux memfd_create() syscall. It then mmap()s the fd, which works fine. It also prints the fd to stdout.

Now when I pass this fd to the client program, it also tries to mmap() it but somehow fails this time.

server.c:

#include 
#include 
#include 
#include 
#include 
#include 

const size_t SIZE = 1024;

int main() {
    int fd = memfd_create("testmemfd", MFD_ALLOW_SEALING);
    // replacing the MFD_ALLOW_SEALING flag with 0 doesn't seem to change anything
    if (fd == -1) {
        perror("memfd_create");
    }
    if (ftruncate(fd, SIZE) == -1) {
        perror("ftruncate");
    }
    void * data = mmap(NULL, SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
    if (data == MAP_FAILED) {
        perror("mmap");
    }
    close(fd);
    // removing close(fd) or the mmap() code doesn't seem to change anything

    printf("%d
", fd);
    while (1) {

    }
    return 0;
}

client.c:

#include 
#include 
#include 
#include 
#include 
#include 

const size_t SIZE = 1024;

int main() {
    int fd = -1;
    scanf("%d", &fd);
    printf("%d
", fd);
    void * data = mmap(NULL, SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
    if (data == MAP_FAILED) {
        perror("mmap");
    }
    return 0;
}

(note that using the memfd_create() syscall needs _GNU_SOURCE to be defined when compiling)

Now I run them:

$ ./server
3

# in another terminal, since server process won't exit:
$ ./client
3
3
mmap: Bad file descriptor

$

Since the server process is still open, why is the fd invalid? why did the fd work fine with mmap on the server but not on another process?

I also tried the code here: a-darwish/memfd-examples, which uses sockets to pass data from the server to the client.

It works fine, but when I change the server to output fd to stdout and the client to read it from stdin instead of the whole socket business, mmap complains of a bad file descriptor again.

Why would it be that mmap works with a fd received from socket but not with stdin?

Then I changed the memfd-examples code to use sockets again, which made it work again. So I added a printf to the server and client to print the fd they were sending/receiving. The code worked fine, despite this strangeness:

$ ./memfd-examples/server
[Mon Jun  8 18:43:27 2020] New client connection!
sending fd = 5

# and in another terminal
$ ./memfd-examples/client
got fd = 4
Message: Secure zero-copy message from server: Mon Jun  8 18:43:27 2020

so the code is working fine, with what seems to be the wrong fd entirely?

I then tried decrementing the received fd in my client program -- doesn't work ("No such device", as one would expect).

So, what am I doing wrong with mmap()?

Guest · Accepted Answer

(Note that this answer is not directed only to OP, but to anyone who encounters a similar problem.)

The problem OP is seeing occurs in different processes, and the underlying issue is how to pass a file descriptor between processes.

File descriptor is a number used by a process to refer to a file description, when dealing with files, sockets, FIFOs, or anything file-like in a Unix or POSIX system.
File description is the internal kernel data structure that refers to a file-like object, and includes position (if seekable), record locks, and so on.
File descriptors are specific to a process. That is, descriptor 3 in one process has nothing to do with descriptor 3 in another process, unless they happen to refer to the same object.
Processes can share the same file description. Unix Domain sockets can be used to pass a file descriptor from one end to the other, between processes. This is not just passing a number; it is a special technique using ancillary data that the OS kernel supports. Essentially, the OS kernel ensures that (usually different) file descriptors refer to the same file description, even when they are in different processes. This also means that the descriptor number gets modified by the kernel in-flight.
There are two different types of Unix Domain sockets: stream and datagram. Stream is very similar to a bidirectional pipe or a TCP stream: there are no messages or message boundaries, just a sequential stream of data. Datagrams are "packets", with each having a specific length. (Please avoid zero-length datagrams.)
Instead of a pipe between a parent and a child process, an Unix Domain stream socket can almost always be used instead: their behaviour is so similar.
If a receiving process uses recv() or read() and not recvmsg() (i.e. is not prepared to receive ancillary data, and only recvmsg() and the Linux extension recvmmsg() handle ancillary data), current Linux C library and kernel do not create the file descriptor on the receiving end. That is, a malicious end cannot spam any number of descriptors to an unsuspecting end; the receiving end can only receive descriptors if it is prepared to do so (by using recvmsg() or recvmmsg()).

In Linux, when the procfs is available, each file descriptor does have a system-accessible name, /proc/PID/fd/FD, where PID is the process ID, and FD is the file descriptor number in that process. (procfs and sysfs, usually mounted at /proc and /sys, respectively, are not physically stored on any media, but are dynamically generated by the OS kernel as they are accessed. Because of this, they are often called pseudofiles or pseudofilesystems: they behave like files in most ways (although their length is usually reported as zero, because the contents do not exist before you read it), but really aren't.)

However, the /proc/PID/ directories are typically only accessible to the user account that particular process is running as; and can be even further restricted if Linux Security Modules like SELinux are used. This is an important security feature, and any attempt to bypass this should be considered a serious potential risk – likely nefarious, in my opinion.

So, there are two possible approaches in Linux: Pass the procfs path to the file descriptor (/proc/PID/fd/FD), and hope the other end can access that, or use an Unix Domain socket (stream or datagram) between the two, and use that to pass the descriptor.

For details on ancillary data management, see man 2 sendmsg and man 3 cmsg.

I personally recommend the descriptor passing approach. Not only is it more robust, but it is also portable between Linux and many Unixy systems, for example the BSD variants (including Mac OS).

This is also the reason why many privileged services that use unprivileged or restricted child processes, like Apache and Nginx HTTP daemons (for example for fastcgi implementation), use Unix domain sockets for interprocess communication.

(Another reason is SCM_CREDENTIALS ancillary data, which consists of the process ID, user ID, and a group ID of the sending process, which the kernel verifies; that allows a receiver to verify the identity of the sender of a particular message, at the moment of sending. This wording may sound oddly complex, but since the sender process may have replaced itself with something new as soon as the message was received but not processed yet, we must be careful and understand the situation correctly to not leave gaping security holes in our software.)

It is unfortunate that OP has already implemented interprocess communication using POSIX message queues (see man 7 mq_overview), but they do not support ancillary data or passing descriptors. A refactoring is in order.

mmap() on fd given by memfd_create() sometimes fails with Bad file descriptor

Answers (1)

Related Questions