Reputation: 3637
I'm doing some micro-benchmarks on AMD GPUs to understand its performance characteristics in order to improve kernel performance. I'm now suspecting that different register allocation and instruction scheduling outcomes may affect the effective memory bandwidth. I noticed that the compiler attempts to interleave memory instructions and compute instructions, it also attempts to conserve registers by loading new values as soon as a previous arithmetic instruction finishes. In some cases, I found there can be a notable performance difference. If I deliberately insert some unoptimizable operations such as writing a dummy value to LDS in order to stop compilers from doing such interleaving, sometimes it improves performance. I suspect the reason is that it changes the number of simultaneous memory requests issued at a moment, causing a reduction of utilized memory bandwidth.
Thus, I decided to use inline assembly when targeting AMD HIP to have better control of the micro-benchmarks.
The following HIP program attempts to use inline assembly to load 4 floats into float4 tmp11, tmp12, tmp13, tmp14
#include <hip/hip_runtime.h>
#include <cstddef>
__global__ void kernel(
float* __restrict array,
float4* out,
uint32_t idx
)
{
float* a_ptr = &array[idx];
float4 tmp11, tmp12, tmp13, tmp14;
#ifdef __HIP_PLATFORM_AMD__
asm volatile(
"global_load_dwordx4 %0, %1, off\n\t"
: "=v" (tmp11)
: "v" (a_ptr)
);
asm volatile(
"global_load_dwordx4 %0, %1, off, offset:16\n\t"
: "=v" (tmp12)
: "v" (a_ptr)
);
asm volatile(
"global_load_dwordx4 %0, %1, off, offset:32\n\t"
: "=v" (tmp13)
: "v" (a_ptr)
);
asm volatile(
"global_load_dwordx4 %0, %1, off, offset:48\n\t"
"s_waitcnt vmcnt(0)"
: "=v" (tmp14)
: "v" (a_ptr)
);
#endif
}
int main(void)
{
}
Compiling the source code with hipcc -S main.cpp -o main.S -O3
and inspecting main.S
, I found the generated assembly is incorrect. The values are all loaded into the same registers.
;;#ASMSTART
global_load_dwordx4 v[0:3], v[4:5], off
;;#ASMEND
;;#ASMSTART
global_load_dwordx4 v[0:3], v[4:5], off, offset:16
;;#ASMEND
;;#ASMSTART
global_load_dwordx4 v[0:3], v[4:5], off, offset:32
;;#ASMEND
;;#ASMSTART
global_load_dwordx4 v[0:3], v[4:5], off, offset:48
s_waitcnt vmcnt(0)
;;#ASMEND
The compiler seems to determine that the loads have no effect and it's free to reuse the same registers for all the assembly instructions, which makes sense. If I do some arithmetic operations with the variables, the values will indeed be loaded different registers, likely due to different register allocations.
Thus, my next attempt is loading multiple values into multiple registers using multiple instructions a single statement of inline assembly. The compiler is now aware that the result should go to different registers.
#include <hip/hip_runtime.h>
#include <cstddef>
__global__ void kernel(
float* __restrict array,
float4* out,
uint32_t idx
)
{
float* a_ptr = &array[idx];
float4 tmp11, tmp12, tmp13, tmp14;
#ifdef __HIP_PLATFORM_AMD__
asm volatile(
"global_load_dwordx4 %0, %4, off\n\t"
"global_load_dwordx4 %1, %4, off offset:16\n\t"
"global_load_dwordx4 %2, %4, off offset:32\n\t"
"global_load_dwordx4 %3, %4, off offset:48\n\t"
"s_waitcnt vmcnt(0)"
: "=v" (tmp11), "=v" (tmp12), "=v" (tmp13), "=v" (tmp14)
: "v" (a_ptr)
);
#endif
}
int main(void)
{
}
Unfortunately, the generated assembly is still incorrect.
;;#ASMSTART
global_load_dwordx4 v[0:3], v[0:1], off
global_load_dwordx4 v[4:7], v[0:1], off offset:16
global_load_dwordx4 v[8:11], v[0:1], off offset:32
global_load_dwordx4 v[12:15], v[0:1], off offset:48
s_waitcnt vmcnt(0)
;;#ASMEND
The first load instruction clobbers registers v[0:1]
so all the subsequent loads would not work as expected.
How do I load multiple float4 from memory to registers using inline GCN assembly in AMD HIP, using the correct inline assembly syntax?
Upvotes: 3
Views: 385
Reputation: 2841
Notice how a_regs
is registered but %0
is never used since it aliases with %5
. This gives you the flexibility to align the float4s to registers however you see fit to experiment for performance.
#include <hip/hip_runtime.h>
#include <cstddef>
__global__ void kernel(
float* __restrict array,
float4* out,
uint32_t idx
)
{
float* a_ptr = &array[idx];
float4 a_regs, tmp11, tmp12, tmp13, tmp14;
#ifdef __HIP_PLATFORM_AMD__
asm volatile(
"global_load_dwordx4 %1, %5, off\n\t"
"global_load_dwordx4 %2, %5, off offset:16\n\t"
"global_load_dwordx4 %3, %5, off offset:32\n\t"
"global_load_dwordx4 %4, %5, off offset:48\n\t"
"s_waitcnt vmcnt(0)"
: "=v" (a_regs), "=v" (tmp11), "=v" (tmp12), "=v" (tmp13), "=v" (tmp14)
: "v" (a_ptr)
);
#endif
}
int main(void)
{
}
Upvotes: 0
Reputation: 2841
This designates the register as an input/output register which forces the compiler to give it its own register.
Upvotes: 0
Reputation: 2841
This works with register clobbering to reduce register pressure
#include <hip/hip_runtime.h>
#include <cstddef>
__global__ void kernel(
float* __restrict array,
float4* out,
uint32_t idx
)
{
float* a_ptr = &array[idx];
float4 tmp11, tmp12, tmp13, tmp14;
#ifdef __HIP_PLATFORM_AMD__
asm volatile(
"global_load_dwordx4 %3, %4, off offset:48\n\t"
"global_load_dwordx4 %2, %4, off offset:32\n\t"
"global_load_dwordx4 %1, %4, off offset:16\n\t"
"global_load_dwordx4 %0, %4, off\n\t"
"s_waitcnt vmcnt(0)"
: "=v" (tmp11), "=v" (tmp12), "=v" (tmp13), "=v" (tmp14)
: "v" (a_ptr)
);
#endif
}
int main(void)
{
}
Upvotes: 0