Reputation: 135
I'll preface this by saying C++ is not my typical area of work, I'm more often in C# and Matlab. I also don't pretend to be able to read x86 assembly code. Having seen some videos recently though on "modern c++" and new instructions on latest processors, I figured I'd poke around a bit more and see what I can learn. I do have some existing C++ DLL's which benefit from speed improvements - those DLL's using many trig and power operations from <cmath>
.
So I whip up a simple benchmark program in VS2013 Express / Desktop. Processor on my machine here is an Intel i7-4800MQ (Haswell). Program is pretty simple, allocates some std::vector<double>
's to a size of 5 million random entries, then loops over doing some math operation combining the values. I measure the time spent using std::chrono::high_resolution_clock::now()
immediately preceding and following the loop:
[Edit: Including full program code]
#include "stdafx.h"
#include <chrono>
#include <random>
#include <cmath>
#include <iostream>
#include <string>
int _tmain(int argc, _TCHAR* argv[])
{
// Set up random number generator
std::tr1::mt19937 eng;
std::tr1::normal_distribution<float> dist;
// Number of calculations to do
uint32_t n_points = 5000000;
// Input vectors
std::vector<double> x1;
std::vector<double> x2;
std::vector<double> x3;
// Output vectors
std::vector<double> y1;
// Initialize
x1.reserve(n_points);
x2.reserve(n_points);
x3.reserve(n_points);
y1.reserve(n_points);
// Fill inputs
for (size_t i = 0; i < n_points; i++)
{
x1.push_back(dist(eng));
x2.push_back(dist(eng));
x3.push_back(dist(eng));
}
// Start timer
auto start_time = std::chrono::high_resolution_clock::now();
// Do math loop
for (size_t i = 0; i < n_points; i++)
{
double result_value;
result_value = std::sin(x1[i]) * x2[i] * std::atan(x3[i]);
y1.push_back(result_value);
}
auto end_time = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::milliseconds>(end_time - start_time);
std::cout << "Duration: " << duration.count() << " ms";
return 0;
}
I put VS into Release configuration with standard options (e.g. /O2). I do one build with /arch:IA32 and run it a few times, and another with /arch:AVX and run it a few times. Consistently, putting the AVX option is ~3.6x slower than the IA32 alternative. In this specific example, to the tune of 773 ms compared to 216.
As a sanity check I did try some other very basic operations.. combination of mults and adds.. taking some number to the 8th power.. and between the two AVX is at least as fast if not a bit faster. So why might my code above be impacted to much? Or where might I look to find out?
Edit 2: At the suggestion of someone on Reddit, I changed the code around into something more vectorize-able... which makes both SSE2 and AVX run faster, but AVX is still much slower than SSE2:
#include "stdafx.h"
#include <chrono>
#include <random>
#include <cmath>
#include <iostream>
#include <string>
int _tmain(int argc, _TCHAR* argv[])
{
// Set up random number generator
std::tr1::mt19937 eng;
std::tr1::normal_distribution<double> dist;
// Number of calculations to do
uint32_t n_points = 5000000;
// Input vectors
std::vector<double> x1;
std::vector<double> x2;
std::vector<double> x3;
// Output vectors
std::vector<double> y1;
// Initialize
x1.reserve(n_points);
x2.reserve(n_points);
x3.reserve(n_points);
y1.reserve(n_points);
// Fill inputs
for (size_t i = 0; i < n_points; i++)
{
x1.push_back(dist(eng));
x2.push_back(dist(eng));
x3.push_back(dist(eng));
y1.push_back(0.0);
}
// Start timer
auto start_time = std::chrono::high_resolution_clock::now();
// Do math loop
for (size_t i = 0; i < n_points; i++)
{
y1[i] = std::sin(x1[i]) * x2[i] * std::atan(x3[i]);
}
auto end_time = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::milliseconds>(end_time - start_time);
std::cout << "Duration: " << duration.count() << " ms";
return 0;
}
IA32: 209 ms SSE: 205 ms SSE2: 75 ms AVX: 371 ms
As for specific version of Visual Studio, this is 2013 Express for Desktop Update 1 (Version 12.0.30110.00 Update 1)
Upvotes: 3
Views: 869
Reputation: 5128
So based on @Lưu Vĩnh Phúc I investigated a bit, you can get this to vectorize very nicely but not using std::vector
or std::valarray
, I also had to alias the pointers when I used std::unique_ptr
otherwise that too would block vectorization.
#include <chrono>
#include <random>
#include <math.h>
#include <iostream>
#include <string>
#include <valarray>
#include <functional>
#include <memory>
#pragma intrinsic(sin, atan)
int wmain(int argc, wchar_t* argv[])
{
// Set up random number generator
std::random_device rd;
std::mt19937 eng(rd());
std::normal_distribution<double> dist;
// Number of calculations to do
const uint32_t n_points = 5000000;
// Input vectors
std::unique_ptr<double[]> x1 = std::make_unique<double[]>(n_points);
std::unique_ptr<double[]> x2 = std::make_unique<double[]>(n_points);
std::unique_ptr<double[]> x3 = std::make_unique<double[]>(n_points);
// Output vectors
std::unique_ptr<double[]> y1 = std::make_unique<double[]>(n_points);
auto random = std::bind(dist, eng);
// Fill inputs
for (size_t i = 0; i < n_points; i++)
{
x1[i] = random();
x2[i] = random();
x3[i] = random();
y1[i] = 0.0;
}
// Start timer
auto start_time = std::chrono::high_resolution_clock::now();
// Do math loop
double * x_1 = x1.get(), *x_2 = x2.get(), *x_3 = x3.get(), *y_1 = y1.get();
for (size_t i = 0; i < n_points; ++i)
{
y_1[i] = sin(x_1[i]) * x_2[i] * atan(x_3[i]);
}
auto end_time = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::milliseconds>(end_time - start_time);
std::cout << "Duration: " << duration.count() << " ms";
std::cin.ignore();
return 0;
}
On my machine compiled with /arch:avx
this took 103ms, /arch:IA32
: 252ms, nothing set: 98ms
Looking at the generated assembly it seems the vector functions are implemented using SSE, as such using AVX instructions around them would cause an impedance and slow things down. Hopefully MS will implement AVX versions in the future.
The relevant asm lacking vzeroupper
:
$LL3@wmain:
vmovupd xmm0, XMMWORD PTR [esi]
call ___vdecl_sin2
mov eax, DWORD PTR tv1250[esp+10212]
vmulpd xmm0, xmm0, XMMWORD PTR [eax+esi]
mov eax, DWORD PTR tv1249[esp+10212]
vmovaps XMMWORD PTR tv1240[esp+10212], xmm0
vmovupd xmm0, XMMWORD PTR [eax+esi]
call ___vdecl_atan2
dec DWORD PTR tv1260[esp+10212]
lea esi, DWORD PTR [esi+16]
vmulpd xmm0, xmm0, XMMWORD PTR tv1240[esp+10212]
vmovupd XMMWORD PTR [edi+esi-16], xmm0
jne SHORT $LL3@wmain
Versus the SSE2 asm note the same vector sin
and atan
calls:
$LL3@wmain:
movupd xmm0, XMMWORD PTR [esi]
call ___vdecl_sin2
mov eax, DWORD PTR tv1250[esp+10164]
movupd xmm1, XMMWORD PTR [eax+esi]
mov eax, DWORD PTR tv1249[esp+10164]
mulpd xmm0, xmm1
movaps XMMWORD PTR tv1241[esp+10164], xmm0
movupd xmm0, XMMWORD PTR [eax+esi]
call ___vdecl_atan2
dec DWORD PTR tv1260[esp+10164]
lea esi, DWORD PTR [esi+16]
movaps xmm1, XMMWORD PTR tv1241[esp+10164]
mulpd xmm1, xmm0
movupd XMMWORD PTR [edi+esi-16], xmm1
jne SHORT $LL3@wmain
Other things of note:
Upvotes: 0
Reputation: 29981
When the CPU switches between using AVX and SSE instructions, it needs to save/restore the upper parts of the ymm registers and can result in a pretty large penalty.
Normally compiling with /arch:AVX
will fix this for your own code, as it will use AVX128 instructions instead of SSE ones where possible. However in this case, it may be that your standard library's math functions are not implemented using AVX instructions, in which case you'd get a transition penalty for every function call. You'd have to post a disassembled version to be sure.
You often see VZEROUPPER
being called before a transition to signal that the CPU doesn't need to save the upper parts of the registers, but the compiler is not smart enough to know if a function it calls requires it too.
Upvotes: 2