Reputation: 1039
I am very new concerning the usage of inline assembly in C++ codes. What I want to do is basicly a kind of memcopy for pointer with a size modulo 32.
In C++ the code use to be something like this :
void my_memcpy(const std::uint8_t* in,std::uint8_t* out,const std::size_t& sz)
{
assert((sz%32 == 0));
for(const std::uint8_t* it = beg; it != (beg+sz);it+=32,out+=32)
{
__m256i = _mm256_stream_load_si256(reinterpret_cast<__m256i*>(it));
_mm256_stream_si256(reinterpret_cast<__m256i*>(out),tmp);
}
}
I already did a little bit of inline assembly, but each time I knew in advance both the size of the input tab, and the output tab.
So I tried this :
void my_memcpy(const std::uint8_t* in,std::uint8_t* out,const std::size_t& sz)
{
assert((sz%32 == 0));
__asm__ volatile(
"mov %1, %%eax \n"
"mov $0, %%ebx \n"
"L1: \n"
"vmovntdqa (%[src],%%ebx), %%ymm0 \n"
"vmovntdq %%ymm0, (%[dst],%%ebx) \n"
"add %%ebx, $32 \n"
"cmp %%eax, %%ebx \n"
"jz L1 \n"
:[dst]"=r"(out)
:[src]"r"(in),"m"(sz)
:"memory"
);
}
G++ told me :
Error: unsupported instruction `mov'
Error: `(%rdi,%ebx)' is not a valid base/index expression
Error: `(%rdi,%ebx)' is not a valid base/index expression
Error: operand type mismatch for `add'
So I tried this :
void my_memcpy(const std::uint8_t* in,std::uint8_t* out,const std::size_t& sz)
{
assert((sz%32 == 0));
__asm__ volatile(
"mov %1, %%eax \n"
"mov $0, %%ebx \n"
"L1: \n"
"vmovntdqa %%ebx(%[src]), %%ymm0 \n"
"vmovntdq %%ymm0, (%[dst],%%ebx) \n"
"add %%ebx, $32 \n"
"cmp %%eax, %%ebx \n"
"jz L1 \n"
:[dst]"=r"(out)
:[src]"r"(in),"m"(sz)
:"memory"
);
}
I obtain from G++ :
Error: unsupported instruction `mov'
Error: junk `(%rdi)' after register
Error: `(%rdi,%ebx)' is not a valid base/index expression
Error: operand type mismatch for `add'
In every case I tried to find without succes a solution. I experience also this solution :
void my_memcpy(const std::uint8_t* in,std::uint8_t* out,const std::size_t& sz)
{
__asm__ volatile (
".intel_syntax noprefix;"
"mov eax, [SZ];"
"mov ebx, 0;"
"L1 : "
"vmovntdqa ymm0, [src+ebx];"
"vmovntdq [dst+ebx], ymm0;"
"add ebx, 32 \n"
"cmp ebx, eax \n"
"jz L1 \n"
".att_syntax;"
: [dst]"=r"(out)
: [SZ]"m"(sz),[src]"r"(in)
: "memory");
}
G++ :
undefined reference to `SZ'
undefined reference to `src'
undefined reference to `dst'
The message in that look like very common, but I have no idea how to fix it in that case.
I know also my tried do not strictly represent the code I wrote in C++.
I would like to understand what's wrong with my tried, and also how to translate as close as possible my C++ function.
Thank's in advance.
Upvotes: 0
Views: 1996
Reputation: 3675
Your first example is the most correct and has following errors:
dst
is declared to be an output, when it should be an input.add
instruction are the wrong way round, in AT&T syntax the destination register is last.And the following performance issues:
sz
parameter is passed by reference. (May also impair optimisations in calling functions)Here is a fixed version, which is no faster than the equivalent C++ with intrinsics:
void my_memcpy(const std::uint8_t* in,std::uint8_t* out,const std::size_t sz)
{
std::size_t count = 0;
__m256i temp;
assert((sz%32 == 0));
__asm__ volatile(
"1: \n"
"vmovntdqa (%[src],%[count]), %[temp] \n"
"vmovntdq %[temp], (%[dst],%[count]) \n"
"add $32, %[count] \n"
"cmp %[sz], %[count] \n"
"jz 1b \n"
:[count]"+r"(count), [temp]"=x"(temp)
:[dst]"r"(out), [src]"r"(in), [sz]"r"(sz)
:"memory", "cc"
);
}
The source and destination parameters are the other way round as memcpy
which is potentially confusing.
Your Intel syntax version addition also fails to use the correct syntax to refer to arguments (eg %[dst]
).
Upvotes: 2