John_Sharp1318
John_Sharp1318

Reputation: 1039

inline assembly + pointer management

I am very new concerning the usage of inline assembly in C++ codes. What I want to do is basicly a kind of memcopy for pointer with a size modulo 32.

In C++ the code use to be something like this :

void my_memcpy(const std::uint8_t* in,std::uint8_t* out,const std::size_t& sz)
{

       assert((sz%32 == 0));

    for(const std::uint8_t* it = beg; it != (beg+sz);it+=32,out+=32)
    {
      __m256i = _mm256_stream_load_si256(reinterpret_cast<__m256i*>(it));
      _mm256_stream_si256(reinterpret_cast<__m256i*>(out),tmp);

    }            
}

I already did a little bit of inline assembly, but each time I knew in advance both the size of the input tab, and the output tab.

So I tried this :

void my_memcpy(const std::uint8_t* in,std::uint8_t* out,const std::size_t& sz)
{

     assert((sz%32 == 0));

    __asm__ volatile(

                "mov %1, %%eax \n"
                "mov $0, %%ebx \n"

                "L1: \n"

                "vmovntdqa (%[src],%%ebx), %%ymm0 \n"
                "vmovntdq  %%ymm0, (%[dst],%%ebx) \n"

                "add %%ebx, $32 \n"

                "cmp %%eax, %%ebx \n"
                "jz L1 \n"

                :[dst]"=r"(out)
                :[src]"r"(in),"m"(sz)
                :"memory"
                );

}

G++ told me :

Error: unsupported instruction `mov'
Error: `(%rdi,%ebx)' is not a valid base/index expression
Error: `(%rdi,%ebx)' is not a valid base/index expression
Error: operand type mismatch for `add'

So I tried this :

void my_memcpy(const std::uint8_t* in,std::uint8_t* out,const std::size_t& sz)
{

     assert((sz%32 == 0));
__asm__ volatile(

            "mov %1, %%eax \n"
            "mov $0, %%ebx \n"

            "L1: \n"

            "vmovntdqa %%ebx(%[src]), %%ymm0 \n"
            "vmovntdq  %%ymm0, (%[dst],%%ebx) \n"

            "add %%ebx, $32 \n"

            "cmp %%eax, %%ebx \n"
            "jz L1 \n"

            :[dst]"=r"(out)
            :[src]"r"(in),"m"(sz)
            :"memory"
                );

}

I obtain from G++ :

Error: unsupported instruction `mov'
Error: junk `(%rdi)' after register
Error: `(%rdi,%ebx)' is not a valid base/index expression
Error: operand type mismatch for `add'

In every case I tried to find without succes a solution. I experience also this solution :

void my_memcpy(const std::uint8_t* in,std::uint8_t* out,const std::size_t& sz)
{

    __asm__ volatile (
          ".intel_syntax noprefix;"

          "mov eax, [SZ];"
          "mov ebx, 0;"

          "L1 : "

          "vmovntdqa ymm0, [src+ebx];"
          "vmovntdq [dst+ebx], ymm0;"

          "add ebx, 32 \n"

          "cmp ebx, eax \n"
          "jz L1 \n"
                ".att_syntax;"
          : [dst]"=r"(out)
          : [SZ]"m"(sz),[src]"r"(in)
          : "memory");



}

G++ :

undefined reference to `SZ'
undefined reference to `src'
undefined reference to `dst'

The message in that look like very common, but I have no idea how to fix it in that case.

I know also my tried do not strictly represent the code I wrote in C++.

I would like to understand what's wrong with my tried, and also how to translate as close as possible my C++ function.

Thank's in advance.

Upvotes: 0

Views: 1996

Answers (1)

Timothy Baldwin
Timothy Baldwin

Reputation: 3675

Your first example is the most correct and has following errors:

  • It uses 32 bit registers instead of 64 bit.
  • 3 registers are changed which are not specified as outputs or clobbers.
  • EAX is loaded with source address, not the size.
  • dst is declared to be an output, when it should be an input.
  • The arguments for the add instruction are the wrong way round, in AT&T syntax the destination register is last.
  • A non-local label is used, which will fail if the asm statement gets duplicated, for example by inlining.

And the following performance issues:

  • The sz parameter is passed by reference. (May also impair optimisations in calling functions)
  • It is then passed into the asm as a memory argument, which requires it is written to memory.
  • Then it is copied to another register.
  • Fixed registers are used instead of letting the compiler choose.

Here is a fixed version, which is no faster than the equivalent C++ with intrinsics:

void my_memcpy(const std::uint8_t* in,std::uint8_t* out,const std::size_t sz)
{
     std::size_t count = 0;
     __m256i temp;

     assert((sz%32 == 0));

    __asm__ volatile(

                "1: \n"

                "vmovntdqa (%[src],%[count]), %[temp] \n"
                "vmovntdq  %[temp], (%[dst],%[count]) \n"

                "add $32, %[count] \n"

                "cmp %[sz], %[count] \n"
                "jz 1b \n"

                :[count]"+r"(count), [temp]"=x"(temp)
                :[dst]"r"(out), [src]"r"(in), [sz]"r"(sz)
                :"memory", "cc"
                );

}

The source and destination parameters are the other way round as memcpy which is potentially confusing.

Your Intel syntax version addition also fails to use the correct syntax to refer to arguments (eg %[dst]).

Upvotes: 2

Related Questions