why is my simd vector plus and set slower than using std::transform and std::plus - am i doing my simd wrong?

Question

New to SIMD please go easy on me if I have made any mistakes.

I am using windows vs studio for dev, msvc ISO C++20. My processor is 11th Gen Intel(R) Core(TM) i7-11370H @ 3.30GHz

Before using AXV, I checked that my PC supported AVX by looking at bit 28 of ECX:

bool supportsAVX()
{
    int cpuInfo[4] = { 0 };
    __cpuid(cpuInfo, 1);
    return (cpuInfo[2] & (1 << 28)) != 0;
}

Before implementing the simd add function for vector, I made sure the vector buffer are 32 byte aligned using implemenation here (implementing an allocator): Modern approach to making std::vector allocate aligned memory. The demo code from this answer is here: https://godbolt.org/z/PG5Ph7936

So I defined a 32 byte aligned vector for AVX as below.

template
using Aligned32Vector = std::vector >;

This is my simd add function:

#include "aligned_vector.hpp"
#include   // For SSE/AVX intrinsics

void simd_add_sd_float_avx(const simd_util::Aligned32Vector& a, const simd_util::Aligned32Vector& b, simd_util::Aligned32Vector& c)
{
    size_t const total_size = a.size();
    constexpr size_t working_width = 32 / sizeof(double);

    size_t i = 0;

    // AVX SIMD loop
    for (; i < total_size - working_width; i += working_width)
    {  // Process 4 double at a time
        // Load
        __m256d va = _mm256_load_pd(&a[i]);  // Load 4 double
        __m256d vb = _mm256_load_pd(&b[i]);  // Load 4 double

        // Perform SIMD addition
        __m256d vsum = _mm256_add_pd(va, vb);  // Add 4 double in parallel

        // Store the result back into the 'result' array
        _mm256_store_pd(&c[i], vsum);  // Store 4 double
    }

    // Handle leftovers
    if (i < total_size)
    {
        size_t remaining = total_size - i;
        alignas(32) double mask_data[4] = { 0.0 };
        
        __m256d mask = _mm256_set_pd(
            remaining > 3 ? -1.0 : 0.0,
            remaining > 2 ? -1.0 : 0.0,
            remaining > 1 ? -1.0 : 0.0,
            remaining > 0 ? -1.0 : 0.0
        );
        // over reading and adding, but who cares we good as long as we dont use them
        __m256d va = _mm256_loadu_pd(&a[i]);
        __m256d vb = _mm256_loadu_pd(&b[i]);
        __m256d vr = _mm256_add_pd(va, vb);
        __m256d existing = _mm256_setzero_pd(); // Set to zeros
        __m256d blended = _mm256_blendv_pd(existing, vr, mask);
    
        // Scalar write-back
        alignas(32) double temp[4];
        _mm256_storeu_pd(temp, blended);
        for (int j = 0; j < remaining; ++j)
        {
            c[i + j] = temp[j];
        }
    }
}

This is my simple add function using std::vector:

#include
#include

void add_vector_float_normal(const std::vector& a, const std::vector& b, std::vector& c) 
{
    std::transform(a.begin(), a.end(), b.begin(), c.begin(), std::plus());
}

My test case is simple, make 2 vector of size 1000000, value 1, add them, then assign the results to a third vector of same size and compare their wall time using chrono, ran with an iteration of 10000.

Scalar:

std::vector a(1000000,1);
std::vector b(1000000,1);
std::vector c(1000000,0);

auto start = std::chrono::high_resolution_clock::now();
for (int i = 0; i < iterations; ++i)
{
    add_vector_float_normal(a,b,c);
}
auto end = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration_cast(end - start);
double average_time_us = static_cast(duration.count()) / iterations;

SIMD:

simd_util::Aligned32Vector a(1000000, 1);
simd_util::Aligned32Vector b(1000000, 1);
simd_util::Aligned32Vector c(1000000, 0);

auto start = std::chrono::high_resolution_clock::now();
for (int i = 0; i < iterations; ++i)
{
    simd_add_sd_float_avx(a, b, c);
}
auto end = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration_cast(end - start);
double average_time_us = static_cast(duration.count()) / iterations;

The scalar run seems always out perform the simd function:

I know my performance measurement is very much an estimate, but I would expect to see simd should overall out perform scalar?

Am I doing somethign wrong? Is it because my remainder handling is not efficient?

EDIT 1

My Compiler Option from vs studio 2022:

/permissive- /ifcOutput "x64\Release\" /GS /GL /W3 /Gy /Zc:wchar_t /Zi /Gm- /O2 /sdl /Fd"x64\Release\vc143.pdb" /Zc:inline /fp:precise /D "NDEBUG" /D "_CONSOLE" /D "_UNICODE" /D "UNICODE" /errorReport:prompt /WX- /Zc:forScope /std:c17 /Gd /Oi /MD /std:c++20 /FC /Fa"x64\Release\" /EHsc /nologo /Fo"x64\Release\" /Fp"x64\Release\simd_matrix.pch" /diagnostics:column

In my vcxproj file, its:

  
    
      Level3
      true
      true
      true
      NDEBUG;_CONSOLE;%(PreprocessorDefinitions)
      true
      stdcpp20
      stdc17
    
    
      Console
      true
      true
      true

EDIT 2

I should probably have done this first, suggested by comments I looked at the ASM generated by my compiler. For the simple std::transform test it looks there is vectorisation/simd generated by the compiler with /O2 flag. Although, I know my PC should support AVX, not sure why its using the smaller SSE registers?

And I still need to figure out why my own simd implementation that supposely is using a bigger register ymm but its is slower.

$LL33@add_vector:
; File C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.42.34433\include\xutility
; Line 501
    movups  xmm0, XMMWORD PTR [rcx]
; File C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.42.34433\include\algorithm
; Line 3763
    add rsi, 8
; File C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.42.34433\include\xutility
; Line 501
    movups  xmm1, XMMWORD PTR [rax]
    addpd   xmm1, xmm0
    movups  xmm0, XMMWORD PTR [rcx+16]
; File C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.42.34433\include\algorithm
; Line 3764
    movups  XMMWORD PTR [rdx], xmm1
; File C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.42.34433\include\xutility
; Line 501
    movups  xmm1, XMMWORD PTR [rax+16]
    addpd   xmm1, xmm0
    movups  xmm0, XMMWORD PTR [rcx+32]
; File C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.42.34433\include\algorithm
; Line 3764
    movups  XMMWORD PTR [rdx+16], xmm1
; File C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.42.34433\include\xutility
; Line 501
    movups  xmm1, XMMWORD PTR [rax+32]
    addpd   xmm1, xmm0
    movups  xmm0, XMMWORD PTR [rcx+48]
; File C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.42.34433\include\algorithm

EDIT 3

Again, the comments from @PeterCordes helped a lot and as he suggested, there is loop unrolling in the msvc generated code! I will try implement this for my AVX implemenation and see if I can see the difference!

why is my simd vector plus and set slower than using std::transform and std::plus<T> - am i doing my simd wrong?

Answers (1)

Related Questions

why is my simd vector plus and set slower than using std::transform and std::plus&lt;T&gt; - am i doing my simd wrong?

Answers (1)

Related Questions

why is my simd vector plus and set slower than using std::transform and std::plus<T> - am i doing my simd wrong?