Reputation: 23
I wrote the following simplified implementation of cat
in assembly. It uses linux syscalls because I am running linux. Here's the code:
.section .data
.set MAX_READ_BYTES, 0xffff
.section .text
.globl _start
_start:
movq (%rsp), %r10 # save the value of argc somewhere else
movq 16(%rsp), %r9 # save the value of argv[1] somewhere else
movl $12, %eax # syscall 12 is brk. see brk(2)
xorq %rdi, %rdi # call with 0 as first arg to get current end of memory
syscall
movq %rax, %r8 # this is the address of the current end of memory
leaq MAX_READ_BYTES(%rax), %rdi # let this be the new end of memory
movl $12, %eax # syscall 12, brk
syscall
cmp %r8, %rax # compare the two; if the allocation failed, these will be equal
je exit
leaq -MAX_READ_BYTES(%rax), %r13 # store the start of the free area in %r13
movq %r10, %rdi # retrieve the value of argc
cmpq $0x01, %rdi # if there are no cli args, process stdin instead
je stdin
# open the file
movl $0x02, %eax # syscall #2 = open.
movq %r9, %rdi
movl $0, %esi # second argument: flags. 0 means read-only.
xorq %rdx, %rdx # this argument isn't used here, but zero it out for peace of mind.
syscall # returns the file descriptor number in %rax
movl %eax, %edi
movl %edi, %r12d # first argument: file descriptor.
call read_and_write
jmp cleanup
stdin:
movl $0x0000, %edi # first argument: file descriptor.
movl %edi, %r12d # first argument: file descriptor.
call read_and_write
jmp cleanup
read_and_write:
# read the file.
movl $0, %eax # syscall #0 = read.
movl %r12d, %edi
movq %r13 /* pointer to allocated memory */, %rsi # second argument: address of a writeable buffer.
movl $MAX_READ_BYTES, %edx # third argument: number of bytes to write.
syscall # num bytes read in %rax
movl %eax, %r15d
# print the file
movl $1, %eax # syscall #1 = write.
movl $1, %edi # first argument: file descriptor. 1 is stdout.
movq %r13, %rsi # second argument: address of data to write.
movl %r15d, %edx # third argument: number of bytes to write.
syscall # result ignored.
cmpq $MAX_READ_BYTES, %r15
je read_and_write
ret
cleanup:
# close the file
movl $0x03, %eax # syscall #3 = close.
movl %r14d, %edi # first arg: file descriptor number.
syscall # result ignored.
exit:
# set the exit code
movl $60, %eax # syscall #60 = exit.
movq $0, %rdi # exit 0 = success.
syscall
I have assembled this into an ELF binary called asmcat
. To test this program, I've got the file /tmp/random
:
$ wc -c /tmp/random
94870 /tmp/random
When I run the following, the results are consistent:
$ ./asmcat /tmp/random | wc -c
94870
Here are two separate runs of the same command:
$ cat /tmp/random | ./asmcat | wc -c
65536
$ cat /tmp/random | ./asmcat | wc -c
94870
Redirecting the output to a file consistently generates files of the same size:
for i in {0..25}; do
cat /tmp/random | ./asmcat > /tmp/asmcat-output-$i
done
for i in {0..25}; do
wc -c /tmp/asmcat-output-$i
done
All of the resulting files have the same size, 94870
. This leads me to believe that the pipe to wc
is what is causing the inconsistent behavior. All my program should be doing is reading stdin, 65535 bytes at a time, and writing to stdout. It's possible that there's a bug in the program, but then, why would it consistently redirect to files of consistent sizes? So my strong feeling is that something about the piping is causing an inconsistent measure of the size of my assembly program's output.
Any feedback is welcome, including the approach taken in the assembly program (which I just wrote for fun/practice).
Upvotes: 2
Views: 284
Reputation: 364287
TL:DR: If your program does two reads before cat
can refill the pipe buffer,
the 2nd read gets only 1 byte. That makes your program decide to exit prematurely.
That's the real bug. The other design choices that make this possible are performance problems, not correctness.
Your program stops after any short-read (one where the return value is less than the requested size), instead of waiting for EOF (read() == 0). This is a simplification that's sometimes safe for regular files, but not safe for anything else, especially not a TTY (terminal input), but also not for pipes or sockets. e.g. try running ./asmcat
; it exits after you press return on one line, instead of waiting for control-D EOF.
Linux pipe buffers are by default only 64kiB (pipe(7)
man page), 1 byte larger than the weird odd-sized buffer you're using. After cat
's write fills the pipe buffer, your 65535-byte read leaves 1 byte remaining. If your program wins the race to read
the pipe before cat
can write again, it reads only 1 byte.
Unfortunately, running under strace ./asmcat
slows down the reads too much to observe a short-read, unless you also slow down cat
or whatever other program to rate-limit the write side of your input pipe.
pv(1)
, the pipe-viewer, is handy for this, with rate-limit -L
option, and a buffer-size limit so you can make sure its writes are smaller than 64k. (Doing a larger 64k write very infrequently might not always lead to short reads.) But if we just want short reads always, running interactively reading from a terminal is even easier. strace ./asmcat
$ pv -L8K -B16K /tmp/random | strace ./orig_asmcat | wc -c
execve("./orig_asmcat", ["./orig_asmcat"], 0x7ffcd441f750 /* 55 vars */) = 0
brk(NULL) = 0x61c000
brk(0x62bfff) = 0x62bfff
read(0, "=head1 NAME\n\n=for comment Gener"..., 65535) = 819
write(1, "=head1 NAME\n\n=for comment Gener"..., 819) = 819
close(0) = 0
exit(0) = ?
+++ exited with 0 +++ # end of strace output
819 # wc output
819 B 0:00:00 [4.43KiB/s] [> ] 0% # pv's progress bar
vs. with a bugfixed asmcat
, we get the expected sequence of short-reads and equal-sized writes. (See below for my version)
execve("./asmcat", ["./asmcat"], 0x7ffd8c58f600 /* 55 vars */) = 0
read(0, "=head1 NAME\n\n=for comment Gener"..., 65536) = 819
write(1, "=head1 NAME\n\n=for comment Gener"..., 819) = 819
read(0, "check if a\nnamed variable exists"..., 65536) = 819
write(1, "check if a\nnamed variable exists"..., 819) = 819
There are multiple wasted instructions, e.g. a mov
that writes a register you never read again, like setting EDI before a call, but then the function call takes R12D as the arg, instead of the standard calling convention.
Reading argc, argv early instead of just leaving them on the stack until they're needed is similarly redundant.
.data
is pointless: .set
is an assemble-time constant. It doesn't matter what the current section is when you define it. You could also write it as MAX_READ_BYTES = 0xffff
, more natural syntax for assemble-time constants.
You could allocate your buffer on the stack instead of with brk (it's only 64K - 1, and x86-64 Linux allows 8MiB stacks by default), in which case loading early could make sense. Or just use the BSS, e.g. lcomm buf, 1<<16
It would be a good idea to make your buffer a power of 2, or at least a multiple of the page size (4k), for efficiency. If you use it to copy files, every read after the first one will start near the end of a page, instead of copying a whole number of 4k pages, so the kernel's copy_to_user
(read) and copy_from_user
(write) will be touching 17 pages of kernel memory per read/write instead of 16. The pagecache for the file data may not be in contiguous kernel addresses, so each separate 4k page takes some overhead to find, and start a separate memcpy for (rep movsb
on modern CPUs with the ERMSB feature). Also for disk I/O, the kernel will have to buffer your writes back into aligned chunks of some multiple of the HW sector size and/or filesystem block size.
64KiB is clearly a good choice when reading from pipes, for the same reason this race was possible. Leaving 1 byte is obviously inefficient. Also, 64k is smaller than L2 cache sizes, so the copy to/from user-space (inside the kernel in your system calls) can re-read from L2 cache when you write again. But smaller sizes mean more system calls, and each system-call has significant overhead (especially with Meltdown and Spectre mitigation in modern kernels.)
64KiB to 128KiB is about a sweet spot for buffer size, given 256KiB L2 caches being typical. (Related: code golf: Fastest yes
in the West tunes a program that just makes write
system-calls, with x86-64 Linux, with profiling / benchmark results on my Skylake desktop.)
Nothing in the machine code benefits from the size fitting in a uint16_t like 0xFFFF does; either int8_t or int32_t are relevant for immediate operand sizes in 64-bit code. (Or uint32_t if you're zero-extending like mov $imm32, %edx
to zero-extend into RDX.)
Don't close stdin
; you run close
unconditionally. closing stdin doesn't affect the parent process's stdin so it shouldn't be a problem in this program, but the whole point of close
seems to be to make this more like a function you could use from a large program. So you should separate your copying fd
to stdout from the file-handling.
Use #include <asm/unistd.h>
to get call numbers instead of hardcoding them. They're guaranteed stable, but it's more human readable / self-documenting to just use the named constants, and avoids any risk of copying errors. (Build with gcc -nostdlib -static asmcat.S -o asmcat
; GCC runs .S
files through the C preprocessor before assembling, unlike .s
)
Style: I like to indent operands to a consistent column so they're not crowding mnemonics. Similarly, comments should be comfortably to the right of operands so you can scan down the column for instructions accessing any given register without getting distracted by comments on shorter instructions.
Comment content: The instruction itself already says what it does, the comment should describe the semantic meaning. (I don't need comments to remind me of calling conventions, like that system calls leave a result in RAX, but even if you do, summarizing the system call with a C version of it can be a good reminder of which arg is which. Like open(argv[1], O_RDONLY)
.)
I also like to remove redundant operand-size suffixes; the register sizes imply operand-size (just like Intel-syntax). Note that zeroing a 64-bit register only requires xorl; writing a 32-bit register implicitly zero-extends to 64-bit. Your code is sometimes inconsistent about whether things should be 32 or 64-bit. In my rewrite, I used 32-bit everywhere I could. (Except cmp %rax, %rdx
return value from write, which seemed like a good idea to make 64-bit, although I don't think there's any real reason.)
I removed the call/ret stuff, and just let it fall through into cleanup/exit instead of trying to separate it into "functions".
I also changed the buffer size to 64KiB exactly, allocated on the stack with 4k page alignment, and rearranged things to simplify and save instructions everywhere.
Also added a # TODO
comment about short writes. That doesn't seem to happen for pipe writes up to 64k; Linux just blocks the write until the buffer has room, but could be a problem writing to a socket maybe? Or maybe only with a larger size, or if a signal like SIGTSTP or SIGSTOP interrupts write()
#include <asm/unistd.h>
BUFSIZE = 1<<16
.section .text
.globl _start
_start:
pop %rax # argc
pop %rdi
pop %rdi # argv[1]
# you'd only ever want to read args this way in _start, which isn't a function
and $-4096, %rsp # round RSP down to a page boundary.
sub $BUFSIZE, %rsp # reserve 64K buffer aligned by 4k
dec %eax # if argc == 1, then run with input fd = 0 (stdin)
jz .Luse_stdin
# open argv[1]
mov $__NR_open, %eax
xor %esi, %esi # flags: 0 means read-only.
xor %edx, %edx # mode unused without O_CREAT, but zero it out for peace of mind.
syscall # fd = open(argv[1], O_RDONLY)
.Luse_stdin: # don't use stdin as a symbol name; stdio.h / libc also has one of type FILE*
mov %eax, %ebx # save FD
mov %rsp, %rsi # always read and write the same buffer
jmp .Lentry # start with a read then EOF-check as loop condition
# since we're now error-checking the write,
# rotating the loop maybe wasn't helpful after all
# and perhaps just read at the top so we can fall into it would work equally well
read_and_write: # do {
# print the file
mov %eax, %edx # size = read_size
mov $__NR_write, %eax # syscall #1 = write.
mov $1, %edi # output fd always stdout
#mov %rsp, %rsi # buf, done once outside loop
syscall # write(1, buf, read_size)
cmp %rax, %rdx # written size should match request
jne cleanup # TODO: handle short writes by calling again for the unwritten part of the buffer, e.g. add %rax, %rsi
# but also check for write errors.
.Lentry:
# read the file.
mov $__NR_read, %eax # xor %eax, %eax
mov %ebx, %edi # input FD
# mov %rsp, %rsi # done once outside loop
mov $BUFSIZE, %edx
syscall # size = read(fd, buf, BUFSIZE)
test %eax, %eax
jg read_and_write # }while(read_size > 0); // until EOF or error
# any negative can be assumed to be an error, since we pass a size smaller than INT_MAX
cleanup:
# fd might be stdin which we don't want to close.
# just exit and let kernel take care of it, or check for fd==0
# movl $__NR_close, %eax
# movl %ebx, %edi
# syscall # close (fd) // return value ignored
exit:
mov %eax, %edi # exit status = last syscall return value. read() = 0 means EOF, success.
mov $__NR_exit_group, %eax
syscall # exit_group(status);
For instruction counts, perf stat --all-user ./asmcat /tmp/random > /dev/null
shows it runs about 47 instructions in user-space, vs. 57 for yours. (IIRC, perf over-counts by 1, so I've subtracted that from the measured result.) And that's with more error-checking, e.g. for short writes.
This is only 84 bytes of machine code in the .text section (vs. 174 bytes for your original), and I didn't optimize for size over speed with stuff like lea 1(%rsi), %eax
(after zeroing RSI) instead of mov $1, %eax
. (Or with mov %eax, %edi
to take advantage of _NR_write == STDIN_FILENO.)
I mostly avoided R8..R15 because they need REX prefixes to access in the machine code.
Tests of error handling:
$ gcc -nostdlib -static asmcat.S -o asmcat # build
$ cat /tmp/random | strace ./asmcat > /dev/full
execve("./asmcat", ["./asmcat"], 0x7ffde5e369d0 /* 55 vars */) = 0
read(0, "=head1 NAME\n\n=for comment Gener"..., 65536) = 65536
write(1, "=head1 NAME\n\n=for comment Gener"..., 65536) = -1 ENOSPC (No space left on device)
exit_group(-28) = ?
+++ exited with 228 +++
$ strace ./asmcat <&- # close stdin
execve("./asmcat", ["./asmcat"], 0x7ffd0f5048c0 /* 55 vars */) = 0
read(0, 0x7ffc1b3ca000, 65536) = -1 EBADF (Bad file descriptor)
exit_group(-9) = ?
+++ exited with 247 +++
$ strace ./asmcat /noexist
execve("./asmcat", ["./asmcat", "/noexist"], 0x7ffd429f1158 /* 55 vars */) = 0
open("/noexist", O_RDONLY) = -1 ENOENT (No such file or directory)
read(-2, 0x7ffd4f296000, 65536) = -1 EBADF (Bad file descriptor)
exit_group(-9) = ?
+++ exited with 247 +++
Hmm, should probably test/jl on the fd after open, if you wanted to do error handling.
Upvotes: 7