class Wrapper { public: // some functions operating on the value_ __m128i value_; }; int main() { std::vector<Wrapper> a; a.resize(100); } Would the value_ attribute of the Wrapper objects in the vector a always occupy contiguous memory without any gaps between the __m128i values ? I mean: [128 bit for 1st Wrapper][no gap here][128bit for 2nd Wrapper] ... So far, this seems to be true for g++ and the Intel cpu I am using, and gcc godbolt. Since there is only a single __m128i attribute in the Wrapper object, does that mean the compiler always do not need to add any kind of padding in memory? ( Memory layout of vector of POD objects ) Test code 1: #include <iostream> #include <vector> #include <x86intrin.h> int main() { static constexpr size_t N = 1000; std::vector<__m128i> a; a.resize(1000); //__m128i a[1000]; uint32_t* ptr_a = reinterpret_cast<uint32_t*>(a.data()); for (size_t i = 0; i < 4*N; ++i) ptr_a[i] = i; for (size_t i = 1; i < N; ++i){ a[i-1] = _mm_and_si128 (a[i], a[i-1]); } for (size_t i = 0; i < 4*N; ++i) std::cout << ptr_a[i]; } Warning: warning: ignoring attributes on template argument '__m128i {aka __vector(2) long long int}' [-Wignored-attributes] Assembly (<a href="https://godbolt.org/#g:!((g:!((g:!((h:codeEditor,i:(j:1,options:(colouriseAsm:'0',compileOnChange:'0'),source:'//+Type+your+code+here,+or+load+an+example.%0A%23include+%3Ciostream%3E%0A%23include+%3Cvector%3E%0A%23include+%3Cx86intrin.h%3E%0A%0Aint+main()%0A%7B%0A++static+constexpr+size_t+N+%3D+1000%3B%0A++std::vector%3C__m128i%3E+a%3B%0A++a.resize(1000)%3B%0A++//__m128i+a%5B1000%5D%3B%0A++uint32_t*+ptr_a+%3D+reinterpret_cast%3Cuint32_t*%3E(a.data())%3B%0A++for+(size_t+i+%3D+0%3B+i+%3C+4*N%3B+%2B%2Bi)%0A++++ptr_a%5Bi%5D+%3D+i%3B%0A++for+(size_t+i+%3D+1%3B+i+%3C+N%3B+%2B%2Bi)%7B%0A++++a%5Bi-1%5D+%3D+_mm_and_si128+(a%5Bi%5D,+a%5Bi-1%5D)%3B%0A++%7D%0A++for+(size_t+i+%3D+0%3B+i+%3C+4*N%3B+%2B%2Bi)%0A++++std::cout+%3C%3C+ptr_a%5Bi%5D%3B%0A%7D'),l:'5',n:'1',o:'C%2B%2B+source+%231',t:'0')),k:50,l:'4',n:'0',o:'',s:0,t:'0'),(g:!((h:compiler,i:(compiler:g62,filters:(b:'0',commentOnly:'0',directives:'0',intel:'0'),options:'-Ofast'),l:'5',n:'0',o:'%231+with+x86-64+gcc+6.2',t:'0')),k:50,l:'4',n:'0',o:'',s:0,t:'0')),l:'2',n:'0',o:'',t:'0')),version:4" rel="nofollow noreferrer">gcc god bolt ): .L9: add rax, 16 movdqa xmm1, XMMWORD PTR [rax] pand xmm0, xmm1 movaps XMMWORD PTR [rax-16], xmm0 cmp rax, rdx movdqa xmm0, xmm1 jne .L9 I guess this means the data is contiguous because the loop just add 16 bytes to the memory address it reads in every cycle of the loop. It is using pand to do the bitwise and. Test code 2: #include <iostream> #include <vector> #include <x86intrin.h> class Wrapper { public: __m128i value_; inline Wrapper& operator &= (const Wrapper& rhs) { value_ = _mm_and_si128(value_, rhs.value_); } }; // Wrapper int main() { static constexpr size_t N = 1000; std::vector<Wrapper> a; a.resize(N); //__m128i a[1000]; uint32_t* ptr_a = reinterpret_cast<uint32_t*>(a.data()); for (size_t i = 0; i < 4*N; ++i) ptr_a[i] = i; for (size_t i = 1; i < N; ++i){ a[i-1] &=a[i]; //std::cout << ptr_a[i]; } for (size_t i = 0; i < 4*N; ++i) std::cout << ptr_a[i]; } Assembly (<a href="https://godbolt.org/#g:!((g:!((g:!((h:codeEditor,i:(j:1,options:(colouriseAsm:'0',compileOnChange:'0'),source:'//+Type+your+code+here,+or+load+an+example.%0A%23include+%3Ciostream%3E%0A%23include+%3Cvector%3E%0A%23include+%3Cx86intrin.h%3E%0Aclass+Wrapper+%7B%0Apublic:%0A++++__m128i+value_%3B%0A++++inline+Wrapper%26+operator+%26%3D+(const+Wrapper%26+rhs)%0A++++%7B%0A++++++++value_+%3D+_mm_and_si128(value_,+rhs.value_)%3B%0A++++%7D%0A%7D%3B+//+Wrapper%0Aint+main()%0A%7B%0A++static+constexpr+size_t+N+%3D+1000%3B%0A++std::vector%3CWrapper%3E+a%3B%0A++a.resize(N)%3B%0A++//__m128i+a%5B1000%5D%3B%0A++uint32_t*+ptr_a+%3D+reinterpret_cast%3Cuint32_t*%3E(a.data())%3B%0A++for+(size_t+i+%3D+0%3B+i+%3C+4*N%3B+%2B%2Bi)+ptr_a%5Bi%5D+%3D+i%3B%0A++for+(size_t+i+%3D+1%3B+i+%3C+N%3B+%2B%2Bi)%7B%0A++++a%5Bi-1%5D+%26%3Da%5Bi%5D%3B%0A++%09//std::cout+%3C%3C+ptr_a%5Bi%5D%3B%0A++%7D%0A++for+(size_t+i+%3D+0%3B+i+%3C+4*N%3B+%2B%2Bi)%0A++++std::cout+%3C%3C+ptr_a%5Bi%5D%3B%0A%7D'),l:'5',n:'1',o:'C%2B%2B+source+%231',t:'0')),k:50,l:'4',n:'0',o:'',s:0,t:'0'),(g:!((h:compiler,i:(compiler:g62,filters:(b:'0',commentOnly:'0',directives:'0',intel:'0'),options:'-Ofast'),l:'5',n:'0',o:'%231+with+x86-64+gcc+6.2',t:'0')),k:50,l:'4',n:'0',o:'',s:0,t:'0')),l:'2',n:'0',o:'',t:'0')),version:4" rel="nofollow noreferrer">gcc god bolt ) .L9: add rdx, 2 add rax, 32 movdqa xmm1, XMMWORD PTR [rax-16] pand xmm0, xmm1 movaps XMMWORD PTR [rax-32], xmm0 movdqa xmm0, XMMWORD PTR [rax] pand xmm1, xmm0 movaps XMMWORD PTR [rax-16], xmm1 cmp rdx, 999 jne .L9 Looks like no padding too. rax increases by 32 in each step, and that is 2 x 16. That extra add rdx,2 is definitely not as good as the loop from test code 1. Test auto-vectorization #include <iostream> #include <vector> #include <x86intrin.h> int main() { static constexpr size_t N = 1000; std::vector<__m128i> a; a.resize(1000); //__m128i a[1000]; uint32_t* ptr_a = reinterpret_cast<uint32_t*>(a.data()); for (size_t i = 0; i < 4*N; ++i) ptr_a[i] = i; for (size_t i = 1; i < N; ++i){ a[i-1] = _mm_and_si128 (a[i], a[i-1]); } for (size_t i = 0; i < 4*N; ++i) std::cout << ptr_a[i]; } Assembly (<a href="https://godbolt.org/#g:!((g:!((g:!((h:codeEditor,i:(j:1,options:(colouriseAsm:'0',compileOnChange:'0'),source:'//+Type+your+code+here,+or+load+an+example.%0A%23include+%3Ciostream%3E%0A%23include+%3Cvector%3E%0A%23include+%3Cx86intrin.h%3E%0Aclass+Wrapper+%7B%0Apublic:%0A++++__m128i+value_%3B%0A++++inline+Wrapper%26+operator+%26%3D+(const+Wrapper%26+rhs)%0A++++%7B%0A++++++++value_+%3D+_mm_and_si128(value_,+rhs.value_)%3B%0A++++%7D%0A%7D%3B+//+Wrapper%0Aint+main()%0A%7B%0A++static+constexpr+size_t+N+%3D+1000%3B%0A++std::vector%3CWrapper%3E+a%3B%0A++a.resize(N)%3B%0A++//__m128i+a%5B1000%5D%3B%0A++uint32_t*+ptr_a+%3D+reinterpret_cast%3Cuint32_t*%3E(a.data())%3B%0A++for+(size_t+i+%3D+0%3B+i+%3C+4*N%3B+%2B%2Bi)+ptr_a%5Bi%5D+%3D+i%3B%0A++for+(size_t+i+%3D+0%3B+i+%3C+N%3B+%2B%2Bi)%7B%0A++++a%5Bi-1%5D+%26%3Da%5Bi%5D%3B%0A++%09//std::cout+%3C%3C+ptr_a%5Bi%5D%3B%0A++%7D%0A++for+(size_t+i+%3D+0%3B+i+%3C+4*N%3B+%2B%2Bi)%0A++++std::cout+%3C%3C+ptr_a%5Bi%5D%3B%0A%7D'),l:'5',n:'1',o:'C%2B%2B+source+%231',t:'0')),k:50,l:'4',n:'0',o:'',s:0,t:'0'),(g:!((h:compiler,i:(compiler:g62,filters:(b:'0',commentOnly:'0',directives:'0',intel:'0'),options:'-Ofast'),l:'5',n:'0',o:'%231+with+x86-64+gcc+6.2',t:'0')),k:50,l:'4',n:'0',o:'',s:0,t:'0')),l:'2',n:'0',o:'',t:'0')),version:4" rel="nofollow noreferrer">god bolt ): .L21: movdqu xmm0, XMMWORD PTR [r10+rax] add rdi, 1 pand xmm0, XMMWORD PTR [r8+rax] movaps XMMWORD PTR [r8+rax], xmm0 add rax, 16 cmp rsi, rdi ja .L21 ... I just don't know if this is always true for intel cpu and g++/intel c++ compilers/(insert compiler name here) ...

Reputation: 241

Does std::vector<Simd_wrapper> have contiguous data in memory?

class Wrapper {
public:
    // some functions operating on the value_
    __m128i value_;
};

int main() {
    std::vector<Wrapper> a;
    a.resize(100);
}

Would the value_ attribute of the Wrapper objects in the vector a always occupy contiguous memory without any gaps between the __m128i values ?

I mean:

[128 bit for 1st Wrapper][no gap here][128bit for 2nd Wrapper] ...

So far, this seems to be true for g++ and the Intel cpu I am using, and gcc godbolt.

Since there is only a single __m128i attribute in the Wrapper object, does that mean the compiler always do not need to add any kind of padding in memory? (Memory layout of vector of POD objects)

Test code 1:

#include <iostream>
#include <vector>
#include <x86intrin.h>

int main()
{
  static constexpr size_t N = 1000;
  std::vector<__m128i> a;
  a.resize(1000);
  //__m128i a[1000];
  uint32_t* ptr_a = reinterpret_cast<uint32_t*>(a.data());
  for (size_t i = 0; i < 4*N; ++i)
    ptr_a[i] = i;
  for (size_t i = 1; i < N; ++i){
    a[i-1] = _mm_and_si128 (a[i], a[i-1]);
  }
  for (size_t i = 0; i < 4*N; ++i)
    std::cout << ptr_a[i];
}

Warning:

warning: ignoring attributes on template argument 
'__m128i {aka __vector(2) long long int}'
[-Wignored-attributes]

Assembly (gcc god bolt):

.L9:
        add     rax, 16
        movdqa  xmm1, XMMWORD PTR [rax]
        pand    xmm0, xmm1
        movaps  XMMWORD PTR [rax-16], xmm0
        cmp     rax, rdx
        movdqa  xmm0, xmm1
        jne     .L9

I guess this means the data is contiguous because the loop just add 16 bytes to the memory address it reads in every cycle of the loop. It is using pand to do the bitwise and.

Test code 2:

#include <iostream>
#include <vector>
#include <x86intrin.h>
class Wrapper {
public:
    __m128i value_;
    inline Wrapper& operator &= (const Wrapper& rhs)
    {
        value_ = _mm_and_si128(value_, rhs.value_);
    }
}; // Wrapper
int main()
{
  static constexpr size_t N = 1000;
  std::vector<Wrapper> a;
  a.resize(N);
  //__m128i a[1000];
  uint32_t* ptr_a = reinterpret_cast<uint32_t*>(a.data());
  for (size_t i = 0; i < 4*N; ++i) ptr_a[i] = i;
  for (size_t i = 1; i < N; ++i){
    a[i-1] &=a[i];
    //std::cout << ptr_a[i];
  }
  for (size_t i = 0; i < 4*N; ++i)
    std::cout << ptr_a[i];
}

Assembly (gcc god bolt)

.L9:
        add     rdx, 2
        add     rax, 32
        movdqa  xmm1, XMMWORD PTR [rax-16]
        pand    xmm0, xmm1
        movaps  XMMWORD PTR [rax-32], xmm0
        movdqa  xmm0, XMMWORD PTR [rax]
        pand    xmm1, xmm0
        movaps  XMMWORD PTR [rax-16], xmm1
        cmp     rdx, 999
        jne     .L9

Looks like no padding too. rax increases by 32 in each step, and that is 2 x 16. That extra add rdx,2 is definitely not as good as the loop from test code 1.

Test auto-vectorization

#include <iostream>
#include <vector>
#include <x86intrin.h>

int main()
{
  static constexpr size_t N = 1000;
  std::vector<__m128i> a;
  a.resize(1000);
  //__m128i a[1000];
  uint32_t* ptr_a = reinterpret_cast<uint32_t*>(a.data());
  for (size_t i = 0; i < 4*N; ++i)
    ptr_a[i] = i;
  for (size_t i = 1; i < N; ++i){
    a[i-1] = _mm_and_si128 (a[i], a[i-1]);
  }
  for (size_t i = 0; i < 4*N; ++i)
    std::cout << ptr_a[i];
}

Assembly (god bolt):

.L21:
        movdqu  xmm0, XMMWORD PTR [r10+rax]
        add     rdi, 1
        pand    xmm0, XMMWORD PTR [r8+rax]
        movaps  XMMWORD PTR [r8+rax], xmm0
        add     rax, 16
        cmp     rsi, rdi
        ja      .L21

... I just don't know if this is always true for intel cpu and g++/intel c++ compilers/(insert compiler name here) ...

Upvotes: 1

Answers (3)

Peter Cordes

Reputation: 364210

No-padding is safe to assume in practice, unless you're compiling for a non-standard ABI.

All compilers targeting the same ABI must make the same choice about struct/class sizes / layouts, and all the standard ABIs / calling conventions will have no padding in your struct. (i.e. x86-32 and x86-64 System V and Windows, see the x86 tag wiki for links). Your experiments with one compiler confirm it for all compilers targeting the same platform/ABI.

Note that the scope of this question is limited to x86 compilers that support Intel's intrinsics and the __m128i type, which means we have much stronger guarantees than what you get from just the ISO C++ standard without any implementation-specific stuff.

As @zneak points out, you can static_assert(std::is_standard_layout<Wrapper>::value) in the class def to remind people not to add any virtual methods, which would add a vtable pointer to each instance.

Upvotes: 1

Cody

Reputation: 2853

It isn't guaranteed. Galik's answer quotes the standard, so I'll focus on some of the risks of assuming that it will be contiguous.

I wrote this small program and compiled with gcc, and it did put the integers contiguously:

#include <iostream>
#include <vector>

class A
{
public:
  int a;
  int method() { return 1;}
  float method2() { return 5.5; }
};

int main()
{
  std::vector<A> as;
  for(int i = 0; i < 10; i++)
  {
     as.push_back(A()); 
  }
  for(int i = 0; i < 10; i++)
  {
     std::cout << &as[i] << std::endl; 
  }
}

However with one small change, the gaps started appearing:

#include <iostream>
#include <vector>

class A
{
public:
  int a;
  int method() { return 1;}
  float method2() { return 5.5; }
  virtual double method3() { return 0.1; } //this is the only change
};

int main()
{
  std::vector<A> as;
  for(int i = 0; i < 10; i++)
  {
     as.push_back(A()); 
  }
  for(int i = 0; i < 10; i++)
  {
     std::cout << &as[i] << std::endl; 
  }
}

Objects with virtual methods (or that inherit from objects with virtual methods) need to store a little extra information to know where to find the appropriate method, because it doesn't know which between the base class or any of the overrides until runtime. This is why it is advised to never use memset on a class. As other answers point out, there may be padding there too, which isn't guaranteed to be consistent across compilers or even different versions of the same compiler.

In the end, it probably is just not worth it to assume that it will be continuous on a given compiler, and even if you test it and it works, simple things like adding a virtual method later will cause you a massive headache.

Upvotes: 1

Galik

Reputation: 48615

There is no guarantee that there won't be padding at the end of the class Wrapper only that there won't be padding at its beginning.

According to the C++11 Standard:

9.2 Class members [ class.mem ]

20 A pointer to a standard-layout struct object, suitably converted using a reinterpret_cast, points to its initial member (or if that member is a bit-field, then to the unit in which it resides) and vice versa. [ Note: There might therefore be unnamed padding within a standard-layout struct object, but not at its beginning, as necessary to achieve appropriate alignment. — end note ]

Also under sizeof:

5.3.3 Sizeof [ expr.sizeof ]

2 When applied to a reference or a reference type, the result is the size of the referenced type. When applied to a class, the result is the number of bytes in an object of that class including any padding required for placing objects of that type in an array.

Upvotes: 2

Does std::vector&lt;Simd_wrapper&gt; have contiguous data in memory?

Answers (3)

Related Questions

Does std::vector<Simd_wrapper> have contiguous data in memory?