Race condition when piping through x86-64 assembly program

Question

I wrote the following simplified implementation of cat in assembly. It uses linux syscalls because I am running linux. Here's the code:

.section .data
.set MAX_READ_BYTES, 0xffff

.section .text
.globl _start

_start:
    movq (%rsp), %r10 # save the value of argc somewhere else
    movq 16(%rsp), %r9 # save the value of argv[1] somewhere else

    movl $12, %eax # syscall 12 is brk. see brk(2)
    xorq %rdi, %rdi # call with 0 as first arg to get current end of memory
    syscall
    movq %rax, %r8 # this is the address of the current end of memory

    leaq MAX_READ_BYTES(%rax), %rdi # let this be the new end of memory
    movl $12, %eax # syscall 12, brk
    syscall
    cmp %r8, %rax # compare the two; if the allocation failed, these will be equal
    je exit

    leaq -MAX_READ_BYTES(%rax), %r13 # store the start of the free area in %r13

    movq %r10, %rdi # retrieve the value of argc
    cmpq $0x01, %rdi # if there are no cli args, process stdin instead
    je stdin

    # open the file
    movl $0x02, %eax # syscall #2 = open.
    movq %r9, %rdi
    movl $0, %esi # second argument: flags. 0 means read-only.
    xorq %rdx, %rdx # this argument isn't used here, but zero it out for peace of mind.
    syscall # returns the file descriptor number in %rax
    movl %eax, %edi
    movl %edi, %r12d # first argument: file descriptor.
    call read_and_write
    jmp cleanup

stdin:
    movl $0x0000, %edi # first argument: file descriptor.
    movl %edi, %r12d # first argument: file descriptor.
    call read_and_write
    jmp cleanup

read_and_write:
    # read the file.
    movl $0, %eax # syscall #0 = read.
    movl %r12d, %edi
    movq %r13 /* pointer to allocated memory */, %rsi # second argument: address of a writeable buffer.
    movl $MAX_READ_BYTES, %edx # third argument: number of bytes to write.
    syscall # num bytes read in %rax
    movl %eax, %r15d

    # print the file
    movl $1, %eax # syscall #1 = write.
    movl $1, %edi # first argument: file descriptor. 1 is stdout.
    movq %r13, %rsi # second argument: address of data to write.
    movl %r15d, %edx # third argument: number of bytes to write.
    syscall # result ignored.
    cmpq $MAX_READ_BYTES, %r15
    je read_and_write
    ret

cleanup:
    # close the file
    movl $0x03, %eax # syscall #3 = close.
    movl %r14d, %edi # first arg: file descriptor number.
    syscall # result ignored.

exit:
    # set the exit code
    movl $60, %eax # syscall #60 = exit.
    movq $0, %rdi # exit 0 = success.
    syscall

I have assembled this into an ELF binary called asmcat. To test this program, I've got the file /tmp/random:

$ wc -c /tmp/random
94870 /tmp/random

When I run the following, the results are consistent:

$ ./asmcat /tmp/random | wc -c
94870

Here are two separate runs of the same command:

$ cat /tmp/random | ./asmcat | wc -c
65536

$ cat /tmp/random | ./asmcat | wc -c
94870

Redirecting the output to a file consistently generates files of the same size:

for i in {0..25}; do
    cat /tmp/random | ./asmcat > /tmp/asmcat-output-$i
done
for i in {0..25}; do
    wc -c /tmp/asmcat-output-$i
done

All of the resulting files have the same size, 94870. This leads me to believe that the pipe to wc is what is causing the inconsistent behavior. All my program should be doing is reading stdin, 65535 bytes at a time, and writing to stdout. It's possible that there's a bug in the program, but then, why would it consistently redirect to files of consistent sizes? So my strong feeling is that something about the piping is causing an inconsistent measure of the size of my assembly program's output.

Any feedback is welcome, including the approach taken in the assembly program (which I just wrote for fun/practice).

Peter Cordes · Accepted Answer

TL:DR: If your program does two reads before cat can refill the pipe buffer,
the 2nd read gets only 1 byte. That makes your program decide to exit prematurely.

That's the real bug. The other design choices that make this possible are performance problems, not correctness.

Your program stops after any short-read (one where the return value is less than the requested size), instead of waiting for EOF (read() == 0). This is a simplification that's sometimes safe for regular files, but not safe for anything else, especially not a TTY (terminal input), but also not for pipes or sockets. e.g. try running ./asmcat; it exits after you press return on one line, instead of waiting for control-D EOF.

Linux pipe buffers are by default only 64kiB (pipe(7) man page), 1 byte larger than the weird odd-sized buffer you're using. After cat's write fills the pipe buffer, your 65535-byte read leaves 1 byte remaining. If your program wins the race to read the pipe before cat can write again, it reads only 1 byte.

Unfortunately, running under strace ./asmcat slows down the reads too much to observe a short-read, unless you also slow down cat or whatever other program to rate-limit the write side of your input pipe.

Rate-limited input for testing:

pv(1), the pipe-viewer, is handy for this, with rate-limit -L option, and a buffer-size limit so you can make sure its writes are smaller than 64k. (Doing a larger 64k write very infrequently might not always lead to short reads.) But if we just want short reads always, running interactively reading from a terminal is even easier. strace ./asmcat

$ pv -L8K -B16K /tmp/random | strace ./orig_asmcat | wc -c
execve("./orig_asmcat", ["./orig_asmcat"], 0x7ffcd441f750 /* 55 vars */) = 0
brk(NULL)                               = 0x61c000
brk(0x62bfff)                           = 0x62bfff
read(0, "=head1 NAME

=for comment  Gener"..., 65535) = 819
write(1, "=head1 NAME

=for comment  Gener"..., 819) = 819
close(0)                                = 0
exit(0)                                 = ?
+++ exited with 0 +++   # end of strace output
819                     # wc output
 819 B 0:00:00 [4.43KiB/s] [>              ]  0%        # pv's progress bar

vs. with a bugfixed asmcat, we get the expected sequence of short-reads and equal-sized writes. (See below for my version)

execve("./asmcat", ["./asmcat"], 0x7ffd8c58f600 /* 55 vars */) = 0
read(0, "=head1 NAME

=for comment  Gener"..., 65536) = 819
write(1, "=head1 NAME

=for comment  Gener"..., 819) = 819
read(0, "check if a
named variable exists"..., 65536) = 819
write(1, "check if a
named variable exists"..., 819) = 819

Code review

There are multiple wasted instructions, e.g. a mov that writes a register you never read again, like setting EDI before a call, but then the function call takes R12D as the arg, instead of the standard calling convention.

Reading argc, argv early instead of just leaving them on the stack until they're needed is similarly redundant.

.data is pointless: .set is an assemble-time constant. It doesn't matter what the current section is when you define it. You could also write it as MAX_READ_BYTES = 0xffff, more natural syntax for assemble-time constants.

You could allocate your buffer on the stack instead of with brk (it's only 64K - 1, and x86-64 Linux allows 8MiB stacks by default), in which case loading early could make sense. Or just use the BSS, e.g. lcomm buf, 1<<16

It would be a good idea to make your buffer a power of 2, or at least a multiple of the page size (4k), for efficiency. If you use it to copy files, every read after the first one will start near the end of a page, instead of copying a whole number of 4k pages, so the kernel's copy_to_user (read) and copy_from_user (write) will be touching 17 pages of kernel memory per read/write instead of 16. The pagecache for the file data may not be in contiguous kernel addresses, so each separate 4k page takes some overhead to find, and start a separate memcpy for (rep movsb on modern CPUs with the ERMSB feature). Also for disk I/O, the kernel will have to buffer your writes back into aligned chunks of some multiple of the HW sector size and/or filesystem block size.

64KiB is clearly a good choice when reading from pipes, for the same reason this race was possible. Leaving 1 byte is obviously inefficient. Also, 64k is smaller than L2 cache sizes, so the copy to/from user-space (inside the kernel in your system calls) can re-read from L2 cache when you write again. But smaller sizes mean more system calls, and each system-call has significant overhead (especially with Meltdown and Spectre mitigation in modern kernels.)

64KiB to 128KiB is about a sweet spot for buffer size, given 256KiB L2 caches being typical. (Related: code golf: Fastest yes in the West tunes a program that just makes write system-calls, with x86-64 Linux, with profiling / benchmark results on my Skylake desktop.)

Nothing in the machine code benefits from the size fitting in a uint16_t like 0xFFFF does; either int8_t or int32_t are relevant for immediate operand sizes in 64-bit code. (Or uint32_t if you're zero-extending like mov $imm32, %edx to zero-extend into RDX.)

Don't close stdin; you run close unconditionally. closing stdin doesn't affect the parent process's stdin so it shouldn't be a problem in this program, but the whole point of close seems to be to make this more like a function you could use from a large program. So you should separate your copying fd to stdout from the file-handling.

Use #include to get call numbers instead of hardcoding them. They're guaranteed stable, but it's more human readable / self-documenting to just use the named constants, and avoids any risk of copying errors. (Build with gcc -nostdlib -static asmcat.S -o asmcat; GCC runs .S files through the C preprocessor before assembling, unlike .s)

Style: I like to indent operands to a consistent column so they're not crowding mnemonics. Similarly, comments should be comfortably to the right of operands so you can scan down the column for instructions accessing any given register without getting distracted by comments on shorter instructions.

Comment content: The instruction itself already says what it does, the comment should describe the semantic meaning. (I don't need comments to remind me of calling conventions, like that system calls leave a result in RAX, but even if you do, summarizing the system call with a C version of it can be a good reminder of which arg is which. Like open(argv[1], O_RDONLY).)

I also like to remove redundant operand-size suffixes; the register sizes imply operand-size (just like Intel-syntax). Note that zeroing a 64-bit register only requires xorl; writing a 32-bit register implicitly zero-extends to 64-bit. Your code is sometimes inconsistent about whether things should be 32 or 64-bit. In my rewrite, I used 32-bit everywhere I could. (Except cmp %rax, %rdx return value from write, which seemed like a good idea to make 64-bit, although I don't think there's any real reason.)

My rewrite:

I removed the call/ret stuff, and just let it fall through into cleanup/exit instead of trying to separate it into "functions".

I also changed the buffer size to 64KiB exactly, allocated on the stack with 4k page alignment, and rearranged things to simplify and save instructions everywhere.

Also added a # TODO comment about short writes. That doesn't seem to happen for pipe writes up to 64k; Linux just blocks the write until the buffer has room, but could be a problem writing to a socket maybe? Or maybe only with a larger size, or if a signal like SIGTSTP or SIGSTOP interrupts write()

#include 
BUFSIZE = 1<<16

.section .text
.globl _start
_start:
    pop  %rax      # argc
    pop  %rdi
    pop  %rdi      # argv[1]
     # you'd only ever want to read args this way in _start, which isn't a function

    and  $-4096, %rsp           # round RSP down to a page boundary.
    sub  $BUFSIZE, %rsp         # reserve 64K buffer aligned by 4k

    dec  %eax      # if argc == 1,  then run with input fd = 0   (stdin)
    jz  .Luse_stdin

    # open argv[1]
    mov     $__NR_open, %eax 
    xor     %esi, %esi     # flags: 0 means read-only.
    xor     %edx, %edx     # mode unused without O_CREAT, but zero it out for peace of mind.
    syscall       # fd = open(argv[1], O_RDONLY)

.Luse_stdin:           # don't use stdin as a symbol name; stdio.h / libc also has one of type FILE*
    mov  %eax, %ebx     # save FD
    mov  %rsp, %rsi     # always read and write the same buffer
    jmp  .Lentry        # start with a read then EOF-check as loop condition
              # since we're now error-checking the write,
              # rotating the loop maybe wasn't helpful after all
              # and perhaps just read at the top so we can fall into it would work equally well

read_and_write:              # do {
    # print the file
    mov     %eax, %edx             # size = read_size
    mov     $__NR_write, %eax      # syscall #1 = write.
    mov     $1, %edi               # output fd always stdout
    #mov     %rsp, %rsi             # buf, done once outside loop
    syscall                        # write(1, buf, read_size)

    cmp     %rax, %rdx             # written size should match request
    jne     cleanup                 # TODO: handle short writes by calling again for the unwritten part of the buffer, e.g. add %rax, %rsi
                                    # but also check for write errors.
.Lentry:
     # read the file.
    mov    $__NR_read, %eax     # xor  %eax, %eax
    mov    %ebx, %edi           # input FD
   # mov    %rsp, %rsi           # done once outside loop
    mov    $BUFSIZE, %edx
    syscall                     # size = read(fd, buf, BUFSIZE)

    test   %eax, %eax
    jg     read_and_write    # }while(read_size > 0);   // until EOF or error
# any negative can be assumed to be an error, since we pass a size smaller than INT_MAX

cleanup:
# fd might be stdin which we don't want to close.
# just exit and let kernel take care of it, or check for fd==0
#    movl $__NR_close, %eax
#    movl %ebx, %edi 
#    syscall          # close (fd)  // return value ignored

exit:
    mov  %eax, %edi             # exit status = last syscall return value. read() = 0 means EOF, success.
    mov  $__NR_exit_group, %eax
    syscall                     # exit_group(status);

For instruction counts, perf stat --all-user ./asmcat /tmp/random > /dev/null shows it runs about 47 instructions in user-space, vs. 57 for yours. (IIRC, perf over-counts by 1, so I've subtracted that from the measured result.) And that's with more error-checking, e.g. for short writes.

This is only 84 bytes of machine code in the .text section (vs. 174 bytes for your original), and I didn't optimize for size over speed with stuff like lea 1(%rsi), %eax (after zeroing RSI) instead of mov $1, %eax. (Or with mov %eax, %edi to take advantage of _NR_write == STDIN_FILENO.)

I mostly avoided R8..R15 because they need REX prefixes to access in the machine code.

Tests of error handling:

$ gcc -nostdlib -static asmcat.S -o asmcat            # build
$ cat /tmp/random | strace ./asmcat > /dev/full

execve("./asmcat", ["./asmcat"], 0x7ffde5e369d0 /* 55 vars */) = 0
read(0, "=head1 NAME

=for comment  Gener"..., 65536) = 65536
write(1, "=head1 NAME

=for comment  Gener"..., 65536) = -1 ENOSPC (No space left on device)
exit_group(-28)                         = ?
+++ exited with 228 +++

$ strace ./asmcat <&-      # close stdin
execve("./asmcat", ["./asmcat"], 0x7ffd0f5048c0 /* 55 vars */) = 0
read(0, 0x7ffc1b3ca000, 65536)          = -1 EBADF (Bad file descriptor)
exit_group(-9)                          = ?
+++ exited with 247 +++

$ strace ./asmcat /noexist
execve("./asmcat", ["./asmcat", "/noexist"], 0x7ffd429f1158 /* 55 vars */) = 0
open("/noexist", O_RDONLY)              = -1 ENOENT (No such file or directory)
read(-2, 0x7ffd4f296000, 65536)         = -1 EBADF (Bad file descriptor)
exit_group(-9)                          = ?
+++ exited with 247 +++

Hmm, should probably test/jl on the fd after open, if you wanted to do error handling.

Race condition when piping through x86-64 assembly program

Answers (1)

Rate-limited input for testing:

Code review

My rewrite:

Related Questions