justapony
justapony

Reputation: 147

C++ operator[] access to elements of SIMD (e.g. AVX) variable

I'm looking for a way to overload operator[] (within a broader SIMD class) to facilitate reading and writing individual elements within a SIMD word (e.g. __m512i). A couple constraints:

(This rules out things like type punning through pointer casting, and GCC vector types.)

Based heavily on Scott Meyers' "More Effective C++" (Item 30), and other code I've come up with the following MVC code that seems "right", that seems to work, but also seems over complicated. (The "proxy" approach is meant to deal with the left/right hand operator[] usage, and the "memcpy" is meant to deal with the type punning/C++ standard issue.)

I'm wonder if someone has a better solution (and can explain it so I learn something ;^))

#include <iostream>
#include <cstring>
#include "immintrin.h"

using T = __m256i;           // SIMD type
using Te = unsigned int;     // SIMD element type

class SIMD {

    class SIMDProxy;

  public :

    const SIMDProxy operator[](int index) const {
      std::cout << "SIMD::operator[] const" << std::endl;
      return SIMDProxy(const_cast<SIMD&>(*this), index);
    }
    SIMDProxy operator[](int index){
      std::cout << "SIMD::operator[]" << std::endl;
      return SIMDProxy(*this, index);
    }
    Te get(int index) {
      std::cout << "SIMD::get" << std::endl;
      alignas(T) Te tmp[8];
      std::memcpy(tmp, &value, sizeof(T));  // _mm256_store_si256(reinterpret_cast<__m256i *>(tmp), c.value);
      return tmp[index];
    }
    void set(int index, Te x) {
      std::cout << "SIMD::set" << std::endl;
      alignas(T) Te tmp[8];
      std::memcpy(tmp, &value, sizeof(T));  // _mm256_store_si256(reinterpret_cast<__m256i *>(tmp), c.value);
      tmp[index] = x;
      std::memcpy(&value, tmp, sizeof(T));  // c.value = _mm256_load_si256(reinterpret_cast<__m256i const *>(tmp));
    }

    void splat(Te x) {
      alignas(T) Te tmp[8];
      std::memcpy(tmp, &value, sizeof(T));
      for (int i=0; i<8; i++) tmp[i] = x;
      std::memcpy(&value, tmp, sizeof(T));
    }
    void print() {
      alignas(T) Te tmp[8];
      std::memcpy(tmp, &value, sizeof(T));
      for (int i=0; i<8; i++) std::cout << tmp[i] << " ";
      std::cout << std::endl;
    }

  protected :

  private :

    T value;

    class SIMDProxy {
      public :
        SIMDProxy(SIMD & c_, int index_) : c(c_), index(index_) {};
        // lvalue access
        SIMDProxy& operator=(const SIMDProxy& rhs) {
          std::cout << "SIMDProxy::=SIMDProxy" << std::endl;
          c.set(rhs.index, rhs.c.get(rhs.index));
          return *this;
        }
        SIMDProxy& operator=(Te x) {
          std::cout << "SIMDProxy::=T" << std::endl;
          c.set(index,x);
          return *this;
        }
        // rvalue access
        operator Te() const {
          std::cout << "SIMDProxy::()" << std::endl;
          return c.get(index);
        }
      private:
        SIMD& c;       // SIMD this proxy refers to
        int index;      // index of element we want
    };
    friend class SIMDProxy;   // give SIMDProxy access into SIMD


};

/** a little main to exercise things **/
int
main(int argc, char *argv[])
{

  SIMD x, y;
  Te a = 3;

  x.splat(1);
  x.print();

  y.splat(2);
  y.print();

  x[0] = a;
  x.print();

  y[1] = a;
  y.print();

  x[1] = y[1]; 
  x.print();
}

Upvotes: 4

Views: 1370

Answers (3)

justapony
justapony

Reputation: 147

Of the original approaches (memcpy, intrinsic load/store), and the additional suggestions (user defined union-punning, user defined vector type) it seems like the intrinsic approach may have a small advantage. This is based on some quick examples I attempted to code up in Godbolt (https://godbolt.org/z/5zdbKe).

The "best" for writing to an element looks something like this.

__m256i foo2(__m256i x, unsigned int a, int index)
{
    alignas(__m256i) unsigned int tmp[8];
    _mm256_store_si256(reinterpret_cast<__m256i *>(tmp), x);
    tmp[index] = a;
    __m256i z = _mm256_load_si256(reinterpret_cast<__m256i const *>(tmp));
    return z;
}

Upvotes: 1

chtz
chtz

Reputation: 18807

If you only care about g++/clang++/icc compatibility, you can just use the __attribute__ which these compilers use internally to define their intrinsic instructions:

typedef int32_t int32x16_t __attribute__((vector_size(16*sizeof(int32_t)))) __attribute__((aligned(16*sizeof(int32_t))));

When it makes sense (and is possible on the given architecture), variables will be stored in vector registers. Also, the compilers provide a read/writeable operator[] for this typedef (which should get optimized, if the index is known at compile-time).

Upvotes: 0

Soonts
Soonts

Reputation: 21936

Your code is very inefficient. Normally these SIMD types are not present anywhere in memory, they are hardware registers, they don’t have addresses and you can’t pass them to memcpy(). Compilers pretend very hard they’re normal variables that’s why your code compiles and probably works, but it’s slow, you’re doing roundtrips from registers to memory and back all the time.

Here’s how I would do that, assuming AVX2 and integer lanes.

class SimdVector
{
    __m256i val;

    alignas( 64 ) static const std::array<int, 8 + 7> s_blendMaskSource;

public:

    int operator[]( size_t lane ) const
    {
        assert( lane < 8 );
        // Move lane index into lowest lane of vector register
        const __m128i shuff = _mm_cvtsi32_si128( (int)lane );
        // Permute the vector so the lane we need is moved to the lowest lane
        // _mm256_castsi128_si256 says "the upper 128 bits of the result are undefined",
        // and we don't care indeed.
        const __m256i tmp = _mm256_permutevar8x32_epi32( val, _mm256_castsi128_si256( shuff ) );
        // Return the lowest lane of the result
        return _mm_cvtsi128_si32( _mm256_castsi256_si128( tmp ) );
    }

    void setLane( size_t lane, int value )
    {
        assert( lane < 8 );
        // Load the blending mask
        const int* const maskLoadPointer = s_blendMaskSource.data() + 7 - lane;
        const __m256i mask = _mm256_loadu_si256( ( const __m256i* )maskLoadPointer );
        // Broadcast the source value into all lanes.
        // The compiler will do equivalent of _mm_cvtsi32_si128 + _mm256_broadcastd_epi32
        const __m256i broadcasted = _mm256_set1_epi32( value );
        // Use vector blending instruction to set the desired lane
        val = _mm256_blendv_epi8( val, broadcasted, mask );
    }

    template<size_t lane>
    int getLane() const
    {
        static_assert( lane < 8 );
        // That thing is not an instruction;
        // compilers emit different ones based on the index
        return _mm256_extract_epi32( val, (int)lane );
    }

    template<size_t lane>
    void setLane( int value )
    {
        static_assert( lane < 8 );
        val = _mm256_insert_epi32( val, value, (int)lane );
    }
};

// Align by 64 bytes to guarantee it's contained within a cache line
alignas( 64 ) const std::array<int, 8 + 7> SimdVector::s_blendMaskSource
{
    0, 0, 0, 0, 0, 0, 0, -1,  0, 0, 0, 0, 0, 0, 0
};

For ARM it’s different. If lane index is known at compile time, see vgetq_lane_s32 and vsetq_lane_s32 intrinsics.

For setting lanes on ARM you can use the same broadcast + blend trick. Broadcast is vdupq_n_s32. An approximate equivalent of vector blend is vbslq_s32, it handles every bit independently, but for this use case it’s equally suitable because -1 has all 32 bits set.

For extracting either write a switch, or store the complete vector into memory, not sure which of these two is more efficient.

Upvotes: 4

Related Questions