Reputation: 3637

How do I Load Multiple Float4 from Memory to Registers using Inline GCN assembly in AMD HIP?

Motivation

I'm doing some micro-benchmarks on AMD GPUs to understand its performance characteristics in order to improve kernel performance. I'm now suspecting that different register allocation and instruction scheduling outcomes may affect the effective memory bandwidth. I noticed that the compiler attempts to interleave memory instructions and compute instructions, it also attempts to conserve registers by loading new values as soon as a previous arithmetic instruction finishes. In some cases, I found there can be a notable performance difference. If I deliberately insert some unoptimizable operations such as writing a dummy value to LDS in order to stop compilers from doing such interleaving, sometimes it improves performance. I suspect the reason is that it changes the number of simultaneous memory requests issued at a moment, causing a reduction of utilized memory bandwidth.

Thus, I decided to use inline assembly when targeting AMD HIP to have better control of the micro-benchmarks.

Attempt 1

The following HIP program attempts to use inline assembly to load 4 floats into float4 tmp11, tmp12, tmp13, tmp14

#include <hip/hip_runtime.h>
#include <cstddef>

__global__ void kernel(
    float* __restrict array,
    float4* out,
    uint32_t idx
)
{
    float* a_ptr = &array[idx];
    float4 tmp11, tmp12, tmp13, tmp14;

#ifdef __HIP_PLATFORM_AMD__
    asm volatile(
        "global_load_dwordx4 %0, %1, off\n\t"
        : "=v" (tmp11)
        : "v" (a_ptr)
    );
    asm volatile(
        "global_load_dwordx4 %0, %1, off, offset:16\n\t"
        : "=v" (tmp12)
        : "v" (a_ptr)
    );
    asm volatile(
        "global_load_dwordx4 %0, %1, off, offset:32\n\t"
        : "=v" (tmp13)
        : "v" (a_ptr)
    );
    asm volatile(
        "global_load_dwordx4 %0, %1, off, offset:48\n\t"
        "s_waitcnt vmcnt(0)"
        : "=v" (tmp14)
        : "v" (a_ptr)
    );
#endif
}

int main(void)
{
}

Compiling the source code with hipcc -S main.cpp -o main.S -O3 and inspecting main.S, I found the generated assembly is incorrect. The values are all loaded into the same registers.

    ;;#ASMSTART
    global_load_dwordx4 v[0:3], v[4:5], off

    ;;#ASMEND
    ;;#ASMSTART
    global_load_dwordx4 v[0:3], v[4:5], off, offset:16

    ;;#ASMEND
    ;;#ASMSTART
    global_load_dwordx4 v[0:3], v[4:5], off, offset:32

    ;;#ASMEND
    ;;#ASMSTART
    global_load_dwordx4 v[0:3], v[4:5], off, offset:48
    s_waitcnt vmcnt(0)
    ;;#ASMEND

Attempt 2

The compiler seems to determine that the loads have no effect and it's free to reuse the same registers for all the assembly instructions, which makes sense. If I do some arithmetic operations with the variables, the values will indeed be loaded different registers, likely due to different register allocations.

Thus, my next attempt is loading multiple values into multiple registers using multiple instructions a single statement of inline assembly. The compiler is now aware that the result should go to different registers.

#include <hip/hip_runtime.h>
#include <cstddef>

__global__ void kernel(
    float* __restrict array,
    float4* out,
    uint32_t idx
)
{
    float* a_ptr = &array[idx];
    float4 tmp11, tmp12, tmp13, tmp14;

#ifdef __HIP_PLATFORM_AMD__
    asm volatile(
        "global_load_dwordx4 %0,  %4, off\n\t"
        "global_load_dwordx4 %1,  %4, off offset:16\n\t"
        "global_load_dwordx4 %2,  %4, off offset:32\n\t"
        "global_load_dwordx4 %3,  %4, off offset:48\n\t"
        "s_waitcnt vmcnt(0)"
        : "=v" (tmp11), "=v" (tmp12), "=v" (tmp13), "=v" (tmp14)
        : "v" (a_ptr)
    );
#endif
}

int main(void)
{
}

Unfortunately, the generated assembly is still incorrect.

    ;;#ASMSTART
    global_load_dwordx4 v[0:3],  v[0:1], off
    global_load_dwordx4 v[4:7],  v[0:1], off offset:16
    global_load_dwordx4 v[8:11],  v[0:1], off offset:32
    global_load_dwordx4 v[12:15],  v[0:1], off offset:48
    s_waitcnt vmcnt(0)
    ;;#ASMEND

The first load instruction clobbers registers v[0:1] so all the subsequent loads would not work as expected.

Question

How do I load multiple float4 from memory to registers using inline GCN assembly in AMD HIP, using the correct inline assembly syntax?

Upvotes: 3

Answers (3)

SargeATM

Reputation: 2841

Add an additional output variable

Notice how a_regs is registered but %0 is never used since it aliases with %5. This gives you the flexibility to align the float4s to registers however you see fit to experiment for performance.

#include <hip/hip_runtime.h>
#include <cstddef>

__global__ void kernel(
    float* __restrict array,
    float4* out,
    uint32_t idx
)
{
    float* a_ptr = &array[idx];
    float4 a_regs, tmp11, tmp12, tmp13, tmp14;

#ifdef __HIP_PLATFORM_AMD__
    asm volatile(
        "global_load_dwordx4 %1,  %5, off\n\t"
        "global_load_dwordx4 %2,  %5, off offset:16\n\t"
        "global_load_dwordx4 %3,  %5, off offset:32\n\t"
        "global_load_dwordx4 %4,  %5, off offset:48\n\t"
        "s_waitcnt vmcnt(0)"
        : "=v" (a_regs), "=v" (tmp11), "=v" (tmp12), "=v" (tmp13), "=v" (tmp14)
        : "v" (a_ptr)
    );
#endif
}

int main(void)
{
}

Upvotes: 0

SargeATM

Reputation: 2841

Use "+v" (a_ptr)

This designates the register as an input/output register which forces the compiler to give it its own register.

Upvotes: 0

SargeATM

Reputation: 2841

Load in reverse-order

This works with register clobbering to reduce register pressure

#include <hip/hip_runtime.h>
#include <cstddef>

__global__ void kernel(
    float* __restrict array,
    float4* out,
    uint32_t idx
)
{
    float* a_ptr = &array[idx];
    float4 tmp11, tmp12, tmp13, tmp14;

#ifdef __HIP_PLATFORM_AMD__
    asm volatile(
        "global_load_dwordx4 %3,  %4, off offset:48\n\t"
        "global_load_dwordx4 %2,  %4, off offset:32\n\t"
        "global_load_dwordx4 %1,  %4, off offset:16\n\t"
        "global_load_dwordx4 %0,  %4, off\n\t"
        "s_waitcnt vmcnt(0)"
        : "=v" (tmp11), "=v" (tmp12), "=v" (tmp13), "=v" (tmp14)
        : "v" (a_ptr)
    );
#endif
}

int main(void)
{
}

Upvotes: 0

How do I Load Multiple Float4 from Memory to Registers using Inline GCN assembly in AMD HIP?

Motivation

Attempt 1

Attempt 2

Question

Answers (3)

Add an additional output variable

Use "+v" (a_ptr)

Load in reverse-order

Related Questions