Reputation: 131
New to SIMD please go easy on me if I have made any mistakes.
I am using windows vs studio for dev, msvc ISO C++20. My processor is 11th Gen Intel(R) Core(TM) i7-11370H @ 3.30GHz
Before using AXV, I checked that my PC supported AVX by looking at bit 28 of ECX:
bool supportsAVX()
{
int cpuInfo[4] = { 0 };
__cpuid(cpuInfo, 1);
return (cpuInfo[2] & (1 << 28)) != 0;
}
Before implementing the simd add function for vector, I made sure the vector buffer are 32 byte aligned using implemenation here (implementing an allocator): Modern approach to making std::vector allocate aligned memory. The demo code from this answer is here: https://godbolt.org/z/PG5Ph7936
So I defined a 32 byte aligned vector for AVX as below.
template<typename T, std::size_t ALIGNMENT_IN_BYTES = 32>
using Aligned32Vector = std::vector<T, AlignedAllocator<T, ALIGNMENT_IN_BYTES> >;
This is my simd add function:
#include "aligned_vector.hpp"
#include <immintrin.h> // For SSE/AVX intrinsics
void simd_add_sd_float_avx(const simd_util::Aligned32Vector<double>& a, const simd_util::Aligned32Vector<double>& b, simd_util::Aligned32Vector<double>& c)
{
size_t const total_size = a.size();
constexpr size_t working_width = 32 / sizeof(double);
size_t i = 0;
// AVX SIMD loop
for (; i < total_size - working_width; i += working_width)
{ // Process 4 double at a time
// Load
__m256d va = _mm256_load_pd(&a[i]); // Load 4 double
__m256d vb = _mm256_load_pd(&b[i]); // Load 4 double
// Perform SIMD addition
__m256d vsum = _mm256_add_pd(va, vb); // Add 4 double in parallel
// Store the result back into the 'result' array
_mm256_store_pd(&c[i], vsum); // Store 4 double
}
// Handle leftovers
if (i < total_size)
{
size_t remaining = total_size - i;
alignas(32) double mask_data[4] = { 0.0 };
__m256d mask = _mm256_set_pd(
remaining > 3 ? -1.0 : 0.0,
remaining > 2 ? -1.0 : 0.0,
remaining > 1 ? -1.0 : 0.0,
remaining > 0 ? -1.0 : 0.0
);
// over reading and adding, but who cares we good as long as we dont use them
__m256d va = _mm256_loadu_pd(&a[i]);
__m256d vb = _mm256_loadu_pd(&b[i]);
__m256d vr = _mm256_add_pd(va, vb);
__m256d existing = _mm256_setzero_pd(); // Set to zeros
__m256d blended = _mm256_blendv_pd(existing, vr, mask);
// Scalar write-back
alignas(32) double temp[4];
_mm256_storeu_pd(temp, blended);
for (int j = 0; j < remaining; ++j)
{
c[i + j] = temp[j];
}
}
}
This is my simple add function using std::vector:
#include<vector>
#include<algorithm>
void add_vector_float_normal(const std::vector<double>& a, const std::vector<double>& b, std::vector<double>& c)
{
std::transform(a.begin(), a.end(), b.begin(), c.begin(), std::plus<double>());
}
My test case is simple, make 2 vector of size 1000000, value 1, add them, then assign the results to a third vector of same size and compare their wall time using chrono, ran with an iteration of 10000.
Scalar:
std::vector<double> a(1000000,1);
std::vector<double> b(1000000,1);
std::vector<double> c(1000000,0);
auto start = std::chrono::high_resolution_clock::now();
for (int i = 0; i < iterations; ++i)
{
add_vector_float_normal(a,b,c);
}
auto end = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::microseconds>(end - start);
double average_time_us = static_cast<double>(duration.count()) / iterations;
SIMD:
simd_util::Aligned32Vector<double> a(1000000, 1);
simd_util::Aligned32Vector<double> b(1000000, 1);
simd_util::Aligned32Vector<double> c(1000000, 0);
auto start = std::chrono::high_resolution_clock::now();
for (int i = 0; i < iterations; ++i)
{
simd_add_sd_float_avx(a, b, c);
}
auto end = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::microseconds>(end - start);
double average_time_us = static_cast<double>(duration.count()) / iterations;
The scalar run seems always out perform the simd function:
I know my performance measurement is very much an estimate, but I would expect to see simd should overall out perform scalar?
Am I doing somethign wrong? Is it because my remainder handling is not efficient?
EDIT 1
My Compiler Option from vs studio 2022:
/permissive- /ifcOutput "x64\Release\" /GS /GL /W3 /Gy /Zc:wchar_t /Zi /Gm- /O2 /sdl /Fd"x64\Release\vc143.pdb" /Zc:inline /fp:precise /D "NDEBUG" /D "_CONSOLE" /D "_UNICODE" /D "UNICODE" /errorReport:prompt /WX- /Zc:forScope /std:c17 /Gd /Oi /MD /std:c++20 /FC /Fa"x64\Release\" /EHsc /nologo /Fo"x64\Release\" /Fp"x64\Release\simd_matrix.pch" /diagnostics:column
In my vcxproj file, its:
<ItemDefinitionGroup Condition="'$(Configuration)|$(Platform)'=='Release|x64'">
<ClCompile>
<WarningLevel>Level3</WarningLevel>
<FunctionLevelLinking>true</FunctionLevelLinking>
<IntrinsicFunctions>true</IntrinsicFunctions>
<SDLCheck>true</SDLCheck>
<PreprocessorDefinitions>NDEBUG;_CONSOLE;%(PreprocessorDefinitions)</PreprocessorDefinitions>
<ConformanceMode>true</ConformanceMode>
<LanguageStandard>stdcpp20</LanguageStandard>
<LanguageStandard_C>stdc17</LanguageStandard_C>
</ClCompile>
<Link>
<SubSystem>Console</SubSystem>
<EnableCOMDATFolding>true</EnableCOMDATFolding>
<OptimizeReferences>true</OptimizeReferences>
<GenerateDebugInformation>true</GenerateDebugInformation>
</Link>
</ItemDefinitionGroup>
EDIT 2
I should probably have done this first, suggested by comments I looked at the ASM generated by my compiler. For the simple std::transform test it looks there is vectorisation/simd generated by the compiler with /O2 flag. Although, I know my PC should support AVX, not sure why its using the smaller SSE registers?
And I still need to figure out why my own simd implementation that supposely is using a bigger register ymm but its is slower.
$LL33@add_vector:
; File C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.42.34433\include\xutility
; Line 501
movups xmm0, XMMWORD PTR [rcx]
; File C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.42.34433\include\algorithm
; Line 3763
add rsi, 8
; File C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.42.34433\include\xutility
; Line 501
movups xmm1, XMMWORD PTR [rax]
addpd xmm1, xmm0
movups xmm0, XMMWORD PTR [rcx+16]
; File C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.42.34433\include\algorithm
; Line 3764
movups XMMWORD PTR [rdx], xmm1
; File C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.42.34433\include\xutility
; Line 501
movups xmm1, XMMWORD PTR [rax+16]
addpd xmm1, xmm0
movups xmm0, XMMWORD PTR [rcx+32]
; File C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.42.34433\include\algorithm
; Line 3764
movups XMMWORD PTR [rdx+16], xmm1
; File C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.42.34433\include\xutility
; Line 501
movups xmm1, XMMWORD PTR [rax+32]
addpd xmm1, xmm0
movups xmm0, XMMWORD PTR [rcx+48]
; File C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.42.34433\include\algorithm
EDIT 3
Again, the comments from @PeterCordes helped a lot and as he suggested, there is loop unrolling in the msvc generated code! I will try implement this for my AVX implemenation and see if I can see the difference!
Upvotes: 3
Views: 116
Reputation: 131
Answering my own question (with helps from comments)
with /O2
flag msvc was able to generate sse instruction for addition. Furthermore, the mscv compiler generated instructions for loop unroll. Combining the two compiler optimisation, it was able out perform my code by a bit (I was using avx).
Here I want to give credits to the people who helped me in the comments section, @PeterCordes and @Homer512 - Thank you both.
I will be reading this book for further study: "Modern X86 Assembly Language Programming: Covers x86 64-bit, AVX, AVX2, and AVX-512"
Upvotes: 0