Is writev() really atomic?

Here is what man writev says:

The data transfers performed by readv() and writev() are atomic: the data written by writev() is written as a single block that is not intermingled with output from writes in other processes (but see pipe(7) for an exception); analogously, readv() is guaranteed

This is from man 7 pipe:

   O_NONBLOCK disabled, n <= PIPE_BUF
          All n bytes are written atomically; write(2) may block if there is not room for n bytes to be written immediately

   O_NONBLOCK enabled, n <= PIPE_BUF
          If there is room to write n bytes to the pipe, then write(2) succeeds immediately, writing all n bytes; otherwise write(2) fails, with errno set to EAGAIN.

   O_NONBLOCK disabled, n > PIPE_BUF
          The write is nonatomic: the data given to write(2) may be interleaved with write(2)s by other process; the write(2) blocks until n bytes have been written.

   O_NONBLOCK enabled, n > PIPE_BUF
          If  the pipe is full, then write(2) fails, with errno set to EAGAIN.  Otherwise, from 1 to n bytes may be written (i.e., a "partial write" may occur; the caller should check the return value from write(2) to see how many bytes were actually written), and these bytes may be interleaved with writes by other processes.
$ cat writev.c
#include <string.h>
#include <sys/uio.h>

int
main(int argc,char **argv) {
    static char part1[] = "ST";
    static char part2[] = "\n";
    struct iovec iov[2];

    iov[0].iov_base = part1;
    iov[0].iov_len = strlen(part1);

    iov[1].iov_base = part2;
    iov[1].iov_len = strlen(part2);

    writev(1,iov,2);

    return 0;
}
$ gcc writev.c
$ unbuffer bash -c 'for ((i=0; i<50; i++)); do ./a.out & ./a.out; done' | wc -c
300  # < PIPE_BUF

# Run the following several times to get the output corrupted
$ unbuffer bash -c 'for ((i=0; i<50; i++)); do ./a.out & ./a.out; done' | sort | uniq -c
      4 
     92 ST
      4 STST

If writev is atomic (according to documentation) can anybody explain why the outputs of different writes are interleaved?

Update:

Some relevant data from strace -fo /tmp/log unbuffer bash -c 'for ((i=0; i<10000; i++)); do ./a.out & ./a.out; done' | sort | uniq -c

13301 writev(1, [{iov_base="ST", iov_len=2}, {iov_base="\n", iov_len=1}], 2 <unfinished ...>
13302 mprotect(0x56397d7d8000, 4096, PROT_READ) = 0
13302 mprotect(0x7f7190c68000, 4096, PROT_READ) = 0
13302 munmap(0x7f7190c51000, 90695)     = 0
13302 writev(1, [{iov_base="ST", iov_len=2}, {iov_base="\n", iov_len=1}], 2) = 3
13301 <... writev resumed> )            = 3
24814 <... select resumed> )            = 1 (in [4])
13302 exit_group(0 <unfinished ...>
13301 exit_group(0 <unfinished ...>
13302 <... exit_group resumed>)         = ?
13301 <... exit_group resumed>)         = ?
24814 futex(0x55b5b8c11cc4, FUTEX_WAKE_PRIVATE, 2147483647 <unfinished ...>
24807 <... futex resumed> )             = 0
24814 <... futex resumed> )             = 1
24807 futex(0x7f7f55e8f920, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
13302 +++ exited with 0 +++
24807 <... futex resumed> )             = -1 EAGAIN (Resource temporarily unavailable)
13301 +++ exited with 0 +++
24807 futex(0x7f7f55e8f920, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
24814 futex(0x7f7f55e8f920, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
24807 <... futex resumed> )             = 0
24814 <... futex resumed> )             = 0
24807 read(4,  <unfinished ...>
24814 select(6, [5], [], [], NULL <unfinished ...>
24807 <... read resumed> "STST\n\n", 4096) = 6
24808 <... wait4 resumed> [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 13302
24807 write(1, "STST\n\n", 6 <unfinished ...>

Upvotes: 3

Views: 1188

Answers (1)

R.. GitHub STOP HELPING ICE
R.. GitHub STOP HELPING ICE

Reputation: 215259

As specified, yes for pipes, when the total iov length does not exceed PIPE_BUF, because:

The writev() function shall be equivalent to write(), except as described below

with no exceptions made for pipes (the word pipe does not even appear in the writev specification).

In practice for Linux, maybe not. writev equivalence to a single write only works on kernel file types that implement the "new" (as of 15 years ago or so) iov-based read/write backends. Some, like terminals, only implement the old interfaces that use a single buffer, and Linux emulates the writev (or readv) as multiple write calls (or resp. read calls). The readv case is also problematic, as you can see in this commit to musl libc.

I'm not sure whether pipes are affected by this issue or not. You'd have to dig into the kernel sources.

Upvotes: 2

Related Questions