Xiyang Liu
Xiyang Liu

Reputation: 131

why is my simd vector plus and set slower than using std::transform and std::plus<T> - am i doing my simd wrong?

New to SIMD please go easy on me if I have made any mistakes.

I am using windows vs studio for dev, msvc ISO C++20. My processor is 11th Gen Intel(R) Core(TM) i7-11370H @ 3.30GHz

Before using AXV, I checked that my PC supported AVX by looking at bit 28 of ECX:

bool supportsAVX()
{
    int cpuInfo[4] = { 0 };
    __cpuid(cpuInfo, 1);
    return (cpuInfo[2] & (1 << 28)) != 0;
}

Before implementing the simd add function for vector, I made sure the vector buffer are 32 byte aligned using implemenation here (implementing an allocator): Modern approach to making std::vector allocate aligned memory. The demo code from this answer is here: https://godbolt.org/z/PG5Ph7936

So I defined a 32 byte aligned vector for AVX as below.

template<typename T, std::size_t ALIGNMENT_IN_BYTES = 32>
using Aligned32Vector = std::vector<T, AlignedAllocator<T, ALIGNMENT_IN_BYTES> >;

This is my simd add function:

#include "aligned_vector.hpp"
#include <immintrin.h>  // For SSE/AVX intrinsics

void simd_add_sd_float_avx(const simd_util::Aligned32Vector<double>& a, const simd_util::Aligned32Vector<double>& b, simd_util::Aligned32Vector<double>& c)
{
    size_t const total_size = a.size();
    constexpr size_t working_width = 32 / sizeof(double);

    size_t i = 0;

    // AVX SIMD loop
    for (; i < total_size - working_width; i += working_width)
    {  // Process 4 double at a time
        // Load
        __m256d va = _mm256_load_pd(&a[i]);  // Load 4 double
        __m256d vb = _mm256_load_pd(&b[i]);  // Load 4 double

        // Perform SIMD addition
        __m256d vsum = _mm256_add_pd(va, vb);  // Add 4 double in parallel

        // Store the result back into the 'result' array
        _mm256_store_pd(&c[i], vsum);  // Store 4 double
    }

    // Handle leftovers
    if (i < total_size)
    {
        size_t remaining = total_size - i;
        alignas(32) double mask_data[4] = { 0.0 };
        
        __m256d mask = _mm256_set_pd(
            remaining > 3 ? -1.0 : 0.0,
            remaining > 2 ? -1.0 : 0.0,
            remaining > 1 ? -1.0 : 0.0,
            remaining > 0 ? -1.0 : 0.0
        );
        // over reading and adding, but who cares we good as long as we dont use them
        __m256d va = _mm256_loadu_pd(&a[i]);
        __m256d vb = _mm256_loadu_pd(&b[i]);
        __m256d vr = _mm256_add_pd(va, vb);
        __m256d existing = _mm256_setzero_pd(); // Set to zeros
        __m256d blended = _mm256_blendv_pd(existing, vr, mask);
    
        // Scalar write-back
        alignas(32) double temp[4];
        _mm256_storeu_pd(temp, blended);
        for (int j = 0; j < remaining; ++j)
        {
            c[i + j] = temp[j];
        }
    }
}

This is my simple add function using std::vector:

#include<vector>
#include<algorithm>

void add_vector_float_normal(const std::vector<double>& a, const std::vector<double>& b, std::vector<double>& c) 
{
    std::transform(a.begin(), a.end(), b.begin(), c.begin(), std::plus<double>());
}

My test case is simple, make 2 vector of size 1000000, value 1, add them, then assign the results to a third vector of same size and compare their wall time using chrono, ran with an iteration of 10000.

Scalar:

std::vector<double> a(1000000,1);
std::vector<double> b(1000000,1);
std::vector<double> c(1000000,0);

auto start = std::chrono::high_resolution_clock::now();
for (int i = 0; i < iterations; ++i)
{
    add_vector_float_normal(a,b,c);
}
auto end = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::microseconds>(end - start);
double average_time_us = static_cast<double>(duration.count()) / iterations;

SIMD:

simd_util::Aligned32Vector<double> a(1000000, 1);
simd_util::Aligned32Vector<double> b(1000000, 1);
simd_util::Aligned32Vector<double> c(1000000, 0);

auto start = std::chrono::high_resolution_clock::now();
for (int i = 0; i < iterations; ++i)
{
    simd_add_sd_float_avx(a, b, c);
}
auto end = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::microseconds>(end - start);
double average_time_us = static_cast<double>(duration.count()) / iterations;

The scalar run seems always out perform the simd function: enter image description here

I know my performance measurement is very much an estimate, but I would expect to see simd should overall out perform scalar?

Am I doing somethign wrong? Is it because my remainder handling is not efficient?

EDIT 1

My Compiler Option from vs studio 2022:

/permissive- /ifcOutput "x64\Release\" /GS /GL /W3 /Gy /Zc:wchar_t /Zi /Gm- /O2 /sdl /Fd"x64\Release\vc143.pdb" /Zc:inline /fp:precise /D "NDEBUG" /D "_CONSOLE" /D "_UNICODE" /D "UNICODE" /errorReport:prompt /WX- /Zc:forScope /std:c17 /Gd /Oi /MD /std:c++20 /FC /Fa"x64\Release\" /EHsc /nologo /Fo"x64\Release\" /Fp"x64\Release\simd_matrix.pch" /diagnostics:column 

In my vcxproj file, its:

  <ItemDefinitionGroup Condition="'$(Configuration)|$(Platform)'=='Release|x64'">
    <ClCompile>
      <WarningLevel>Level3</WarningLevel>
      <FunctionLevelLinking>true</FunctionLevelLinking>
      <IntrinsicFunctions>true</IntrinsicFunctions>
      <SDLCheck>true</SDLCheck>
      <PreprocessorDefinitions>NDEBUG;_CONSOLE;%(PreprocessorDefinitions)</PreprocessorDefinitions>
      <ConformanceMode>true</ConformanceMode>
      <LanguageStandard>stdcpp20</LanguageStandard>
      <LanguageStandard_C>stdc17</LanguageStandard_C>
    </ClCompile>
    <Link>
      <SubSystem>Console</SubSystem>
      <EnableCOMDATFolding>true</EnableCOMDATFolding>
      <OptimizeReferences>true</OptimizeReferences>
      <GenerateDebugInformation>true</GenerateDebugInformation>
    </Link>
  </ItemDefinitionGroup>

EDIT 2

I should probably have done this first, suggested by comments I looked at the ASM generated by my compiler. For the simple std::transform test it looks there is vectorisation/simd generated by the compiler with /O2 flag. Although, I know my PC should support AVX, not sure why its using the smaller SSE registers?

And I still need to figure out why my own simd implementation that supposely is using a bigger register ymm but its is slower.

$LL33@add_vector:
; File C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.42.34433\include\xutility
; Line 501
    movups  xmm0, XMMWORD PTR [rcx]
; File C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.42.34433\include\algorithm
; Line 3763
    add rsi, 8
; File C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.42.34433\include\xutility
; Line 501
    movups  xmm1, XMMWORD PTR [rax]
    addpd   xmm1, xmm0
    movups  xmm0, XMMWORD PTR [rcx+16]
; File C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.42.34433\include\algorithm
; Line 3764
    movups  XMMWORD PTR [rdx], xmm1
; File C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.42.34433\include\xutility
; Line 501
    movups  xmm1, XMMWORD PTR [rax+16]
    addpd   xmm1, xmm0
    movups  xmm0, XMMWORD PTR [rcx+32]
; File C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.42.34433\include\algorithm
; Line 3764
    movups  XMMWORD PTR [rdx+16], xmm1
; File C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.42.34433\include\xutility
; Line 501
    movups  xmm1, XMMWORD PTR [rax+32]
    addpd   xmm1, xmm0
    movups  xmm0, XMMWORD PTR [rcx+48]
; File C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.42.34433\include\algorithm

EDIT 3

Again, the comments from @PeterCordes helped a lot and as he suggested, there is loop unrolling in the msvc generated code! I will try implement this for my AVX implemenation and see if I can see the difference!

Upvotes: 3

Views: 116

Answers (1)

Xiyang Liu
Xiyang Liu

Reputation: 131

Answering my own question (with helps from comments)

with /O2 flag msvc was able to generate sse instruction for addition. Furthermore, the mscv compiler generated instructions for loop unroll. Combining the two compiler optimisation, it was able out perform my code by a bit (I was using avx).

Here I want to give credits to the people who helped me in the comments section, @PeterCordes and @Homer512 - Thank you both.

I will be reading this book for further study: "Modern X86 Assembly Language Programming: Covers x86 64-bit, AVX, AVX2, and AVX-512"

Upvotes: 0

Related Questions