Florian Bach
Florian Bach

Reputation: 989

Get byte representation of ASM instruction within C code

Is there a way, within C code, to go from a textual representation of an ASM instruction (like cmpwi r3, 0x20) to its binary representation (0x2c030020)?

I am writing code that will be embedded into another application at runtime. That code is supposed to alter the behaviour / the code of the running program. That means, there is a bunch of code lines like this:

*((volatile int *)(0x80001234)) = 0x2c030020;

That code writes the ASM instruction cmpwi r3, 0x20 to 0x80001234, overwriting the current instruction at that address. Now, having the constant "0x2c030020" in my C code without knowing what that does is bad for maintaining the code. Thus, I'd usually add comments to code like the one above, stating the ASM instruction: // 2c 03 00 20 = cmpwi r3, 0x20

However, from time to time these get out of sync. I might do a quick change to the integer value and forgot to update the comment, or I might just make a typo in the comment, causing confusion.

Is there some way I could do something like this instead? (pseudo-code) *((volatile int *)(0x80001234)) = asm("cmpwi r3, 0x20"); which would then result in 0x2c030020 being written to 80001234? Or would I need a hacky solution with a custom preprocessor running over my C source files, replacing ASM instructions with their byte code?

I know there is the C syntax for inline assembler code using the asm() function, but that would execute the given ASM instructions, not give me their binary representation.

Upvotes: 3

Views: 1871

Answers (2)

Peter Cordes
Peter Cordes

Reputation: 365537

If you're building the code to run on PowerPC, another way to get those machine code bytes into your object file is with an asm statement at global scope that assembles instructions into the .data or .rodata section.

asm(".section .rodata      \n\t"  // or .data if you want to modify it
    ".globl machine_code;  \n\t"
    "machine_code:         \n\t"
    "cmpwi   3,0x20        \n\t"
       ... );

extern uint32_t machine_code[];  // Declaration of the symbol that you define with asm

This is at global scope, and I think GCC will always change to the section it wants before emitting asm for anything (data or code), so you should be fine with .section instead of .pushsection .rodata first / .popsection after like you'd need if you were emitting some static data from an asm statement inside a function.

The extern uint32_t machine_code[]; C declaration connects the C array name to the asm symbol name so you can just access the array to copy from it.

(AFAIK, PowerPC doesn't have an equivalent of ARM Thumb or RISC-V RV32c, so instruction words are always 32-bit. On RISCs with compressed instructions, you might declare it as an array of uint16_t, or on x86 as an array of uint8_t, and finding instruction boundaries would be a separate problem.)


If you want to be able to execute this machine code from here, put it in .text, which is executable as well as readable. (And declare it as a function prototype instead of an array, or point a function pointer at the array.)


Nick's answer, using CPP constants for array initializers, has the advantage of giving you the machine code as compile-time constants the compiler can see and use as immediates, if it wants. It also results in portable C that can compile for targets other than PowerPC.

Upvotes: 1

Nick ODell
Nick ODell

Reputation: 25409

This sounds like a mad thing to do, but I assume you have a good reason for it. Life's no fun without a little bit of madness.

One approach you could use is to use an assembler to during your build to generate compile-time constants.

The first step is to make a file that has every assembly instruction you will use, one per line.

For example:

cmpwi   3,0x20
addi    3,3,0
blr

Name that file input.def. Then, use this shell script:

#!/usr/bin/env bash

(cat << HEADER
    .global main
    .text
main:
HEADER
cat input.def) > asm.s

powerpc-linux-gnu-as asm.s -o asm.o

powerpc-linux-gnu-objdump -d asm.o | \
    sed '1,/<main>/ d' | \
    paste -d'\t' - input.def | \
    awk -F'\t' '{
        bytes=$2
        asm=$4
        disasm=$3
        gsub(/ /, "", bytes);
        gsub(/[, ]+/, "_", asm);
        printf("#define ASM_%-20s 0x%s    // disassembly: %s\n", asm, bytes, disasm)
    }'

# Clean temporaries
rm asm.s asm.o

(I am using GNU assembler and objdump here. You might need to change this part if you don't use those tools. objdump is being used as a glorified hexdump utility here.)

This shell script:

  1. Creates an assembly file
  2. Assembles it
  3. Puts it side by side with input.def. (This is so it can see what assembly you typed.)
  4. Reformats the hex so it is a legal C constant. Reformats the asm so it is a legal C symbol. Then, writes a define to map the instruction name to the constant.
  5. Put all of this in asm.h

This is a lot of work, but you can do all of it at compile time.

This produces a header file named asm.h:

#define ASM_cmpwi_3_0x20         0x2c030020    // disassembly: cmpwi   r3,32
#define ASM_addi_3_3_0           0x38630000    // disassembly: addi    r3,r3,0
#define ASM_blr                  0x4e800020    // disassembly: blr

You use the asm.h file like this:

#include "asm.h"
*((volatile int *)(0x80001234)) = ASM_cmpwi_3_0x20;

If you need a new asm constant, edit input.def and re-run the shell script.

Upvotes: 1

Related Questions