computador7
computador7

Reputation: 170

inline asm code organization

I have just written a few small inline asm routines to query the timestamp counter in x86 so that I can profile small portions of code. I would really like to put those routines in a header so that I can reuse them in many different source files so basically my question is whether I should just organize those in macros or make them inline functions, my doubt with inline is that it is not necessarily the case that the compiler will actually inline it and since it is a performance sensitive call I would rather skip the function call overhead, on the other hand with macros the whole type safety goes away and I would strictly need a 32 bit int for this, I assume I could just add the specification in comments but still I try to avoid macros because of the many caveats. Here is the code:

inline void rdtsc(uint64_t* cycles)
{
    uint32_t cycles_high, cycles_low;

    asm volatile (
            ".att_syntax\n"
            "CPUID\n\t"    //Serialize
            "RDTSC\n\t"    //Read clock and cpuid
            "mov %%edx, %0 \n\t"
            "mov %%eax, %1 \n\t" 
             : "=r" (cycles_high), "=r" (cycles_low)
             :: "%edx", "%eax");

    *cycles = ((uint64_t) cycles_high << 32) | cycles_low;
}

Any suggestions on this are welcome. I am just trying to figure out what the preferred style would be for this kind of situation.

Upvotes: 1

Views: 219

Answers (2)

Edward
Edward

Reputation: 7080

If you really need to serialize before reading the TSC, you could use the LFENCE instruction instead which doesn't alter registers.

If you decide to continue to use CPUID for serialization, you ought to set EAX first (probably to 0, since you're not really concerned about the output) and note that this instruction trashes the EAX, EBX, ECX and EDX registers, so your routine MUST account for this fact.

In all, I'd be inclined to write it like this:

#include <stdint.h>
#include <stdio.h>

inline uint64_t rdtsc() {
    uint32_t high, low;
    asm volatile ( 
            ".att_syntax\n\t" 
            "LFENCE\n\t"   
            "RDTSC\n\t"   
            "movl %%eax, %0\n\t" 
            "movl %%edx, %1\n\t"
             : "=rm" (low), "=rm" (high)  
             :: "%edx", "%eax"); 
    return ((uint64_t) high << 32) | low;
}

int main() {
    uint64_t x, y;
    x = rdtsc();
    printf("%lu\n", x);
    y = rdtsc();
    printf("%lu\n", y);
    printf("%lu\n", y-x);
}

update:

It's been proposed by @Jester, and by @DavidWohlferd that one can eliminate the register allocations by assigning high and low directly to the edx and eax registers.

That version would look like this:

inline uint64_t rdtsc() {
    uint32_t high, low;
    asm volatile ( 
            ".att_syntax\n\t" 
            "LFENCE\n\t"   
            "RDTSC\n\t"   
             : "=a" (low), "=d" (high)  
             :: );
    return ((uint64_t) high << 32) | low;
}

The resulting code (using gcc 4.8.3 on a 64-bit machine running Linux) using optimization -O2 and including up to the call to printf, is this:

#APP
# 20 "rdtsc.c" 1
    .att_syntax
    LFENCE
    RDTSC

# 0 "" 2
#NO_APP
    movq    %rdx, %rbx
    movl    %eax, %eax
    movl    $.LC0, %edi
    salq    $32, %rbx
    orq %rax, %rbx
    xorl    %eax, %eax
    movq    %rbx, %rsi
    call    printf

The version I originally posted results in this:

#APP
# 7 "rdtsc.c" 1
    .att_syntax
    LFENCE
    RDTSC
    movl %eax, %ecx
    movl %edx, %ebx

# 0 "" 2
#NO_APP
    movl    %ecx, %ecx
    salq    $32, %rbx
    movl    $.LC0, %edi
    orq %rcx, %rbx
    xorl    %eax, %eax
    movq    %rbx, %rsi
    call    printf

That version of the code is one instruction longer.

Upvotes: 1

Tchou
Tchou

Reputation: 70

Since you will be measuring performance of portions of code, not necessarily always entire functions, you should not try to inline your performance counter. It doesn't matter if there's a call overhead or not. What matter is that the mesurement is consistent, which means you either want ALWAYS the call overhead to be present, or NEVER. The first is much easier to achieve than the former.

Let every portion of your code have the same call overhead.

Upvotes: 1

Related Questions