Nick
Nick

Reputation: 749

Keyboard buffer in Intel Assembly

I'm having trouble managing "keyboard overflows" on Intel Assembly. The main issue is that after reading the maximum size specified by the read call, the remaining data is throw in the terminal. I'm using Linux on the x64 architecture. This is in fact, homework. My main idea is the following:

%define maxChars     10
%define maxChars_2   100

section .bss
   strLida  : resb maxChars
   strLidaL : resd 1

read:
   mov dword [strLidaL], maxChars

   mov rax, 0
   mov rdi, 1
   mov rsi, strLida
   mov rdx, [strLidaL]
   syscall

   mov [strLidaL], rax

size_compare:   
   cmp [strLidaL], maxChars
   jge overflow

overflow:
   mov dword [strLidaL_2], maxChars_2

   mov rax, 0
   mov rdi, 1
   mov rsi, strLida_2
   mov rdx, [strLidaL_2]
   syscall

This is far from a good solution, it jumps to another read function when the max characters so it can swallow the remaining overflowing characters. There is a syscall for that? There's a better solution? Thanks for the input.

Upvotes: 1

Views: 507

Answers (1)

Margaret Bloom
Margaret Bloom

Reputation: 44066

Your solution, once generalised, is perfectly fine.

First of all, consider this C program

cook.c

#include <stdio.h>

int main()
{
  char buffer[200];
  scanf("%s", buffer);

  return 0;
}

it's vulnerable and the return is redundant but bear with me.
This program just reads a string from the input, pretty much like yours.

If you type a short string like hello world scanf will read hello into buffer but world won't appear in the terminal (unlike your program). So how does scanf do the trick?

A handy way to analyse a program without reverse engineering it (or fetching the source) is strace.
If I run strace ./cook in my system I can see that cook executes the sys_read system call as

read(0, "hello world\n", 1024)          = 12

Thus scanf simply reads, in this case, in chunks of 1024 bytes.
I don't know the logic used by libc to set the length of the read and since I don't think it's relevant here I won't dig into it.

What if we type more than 1024 characters?
If I type 1 2 3 4 ... 1024 (i.e. all the numbers up to 1024 separated by a space) and press the result is

manager@debian64-jboss:~$ ./cook
1 2 3 4 5 [... omitted]
manager@debian64-jboss:~$ 284 285 286 287 288 289 290 [... omitted]

showing that part of the input makes it to the terminal prompt.
If we do the math we get 9*2 + 90*3 + 184 * 4 = 1024 as expected.

Long story short: you are not really experiencing a problem - that's the expected behaviour under Linux.
In your case, it is more annoying because you are reading a low number of bytes.
The long story involves the input processing mode: canonical or non-canonical.
The default one is canonical where the OS buffers lines of text in order to provide input editing facilities.

If your program asks for 5 bytes and the user types hello world and presses enter the OS will buffer the whole "hello world\n" string but sys_read will read only up to the space, leaving " world\n" for the next reader (the shell).

You can choose to fix or mitigate this.

Reading in bigger sizes mitigates the problem - like the C example. Since you should always check the return value of a function or system call this shouldn't impact heavily on your program layout.

Alternatively, you can follow the advice of comp.lang.c and read all the input.
In assembly, you can do that in a general way with

 ;edi = file descriptor
emptyfd:
 lea rsi, [rsp-80h]     ;We use the redzone for the read buffer
 mov edx, 80h           ;Chunk length

.read_chunk:
 xor eax, eax       ;sys_read
 syscall

 ;We read all the buffer? (Note: this also check for errors as long as rdx != -1)
 cmp rdx, rax
 je .read_chunk

 ret

Beware of the clobbered registers.

I'm not aware of any system call doing this, I don't expect any though - the standard input has no special meaning for the kernel.


As a side note, a good way to zero a register is xoring it with itself.
Also, moving or performing an operation on the lower 32-bit part of a 64-bit register zeroes the upper 32 bits - so mov rdi, 1 can be written as mov edi, 1.
NASM will implicitly convert the former into the latter anyway.

Upvotes: 1

Related Questions