Tom S
Tom S

Reputation: 135

AVX 3.6x slower than IA32 in simple benchmark involving <cmath> operations - why so? (VS2013)

I'll preface this by saying C++ is not my typical area of work, I'm more often in C# and Matlab. I also don't pretend to be able to read x86 assembly code. Having seen some videos recently though on "modern c++" and new instructions on latest processors, I figured I'd poke around a bit more and see what I can learn. I do have some existing C++ DLL's which benefit from speed improvements - those DLL's using many trig and power operations from <cmath>.

So I whip up a simple benchmark program in VS2013 Express / Desktop. Processor on my machine here is an Intel i7-4800MQ (Haswell). Program is pretty simple, allocates some std::vector<double>'s to a size of 5 million random entries, then loops over doing some math operation combining the values. I measure the time spent using std::chrono::high_resolution_clock::now() immediately preceding and following the loop:

[Edit: Including full program code]

#include "stdafx.h"
#include <chrono>
#include <random>
#include <cmath>
#include <iostream>
#include <string>

int _tmain(int argc, _TCHAR* argv[])
{

    // Set up random number generator
    std::tr1::mt19937 eng;
    std::tr1::normal_distribution<float> dist;

    // Number of calculations to do
    uint32_t n_points = 5000000;

    // Input vectors
    std::vector<double> x1;
    std::vector<double> x2;
    std::vector<double> x3;

    // Output vectors
    std::vector<double> y1;

    // Initialize
    x1.reserve(n_points);
    x2.reserve(n_points);
    x3.reserve(n_points);
    y1.reserve(n_points);

    // Fill inputs
    for (size_t i = 0; i < n_points; i++)
    {
        x1.push_back(dist(eng));
        x2.push_back(dist(eng));
        x3.push_back(dist(eng));
    }

    // Start timer
    auto start_time = std::chrono::high_resolution_clock::now();

    // Do math loop
    for (size_t i = 0; i < n_points; i++)
    {
        double result_value; 

        result_value = std::sin(x1[i]) * x2[i] * std::atan(x3[i]);

        y1.push_back(result_value);
    }

    auto end_time = std::chrono::high_resolution_clock::now();
    auto duration = std::chrono::duration_cast<std::chrono::milliseconds>(end_time - start_time);
    std::cout << "Duration: " << duration.count() << " ms";

    return 0;
}

I put VS into Release configuration with standard options (e.g. /O2). I do one build with /arch:IA32 and run it a few times, and another with /arch:AVX and run it a few times. Consistently, putting the AVX option is ~3.6x slower than the IA32 alternative. In this specific example, to the tune of 773 ms compared to 216.

As a sanity check I did try some other very basic operations.. combination of mults and adds.. taking some number to the 8th power.. and between the two AVX is at least as fast if not a bit faster. So why might my code above be impacted to much? Or where might I look to find out?

Edit 2: At the suggestion of someone on Reddit, I changed the code around into something more vectorize-able... which makes both SSE2 and AVX run faster, but AVX is still much slower than SSE2:

#include "stdafx.h"
#include <chrono>
#include <random>
#include <cmath>
#include <iostream>
#include <string>

int _tmain(int argc, _TCHAR* argv[])
{

    // Set up random number generator
    std::tr1::mt19937 eng;
    std::tr1::normal_distribution<double> dist;

    // Number of calculations to do
    uint32_t n_points = 5000000;

    // Input vectors
    std::vector<double> x1;
    std::vector<double> x2;
    std::vector<double> x3;

    // Output vectors
    std::vector<double> y1;

    // Initialize
    x1.reserve(n_points);
    x2.reserve(n_points);
    x3.reserve(n_points);
    y1.reserve(n_points);

    // Fill inputs
    for (size_t i = 0; i < n_points; i++)
    {
        x1.push_back(dist(eng));
        x2.push_back(dist(eng));
        x3.push_back(dist(eng));
        y1.push_back(0.0);
    }

    // Start timer
    auto start_time = std::chrono::high_resolution_clock::now();

    // Do math loop
    for (size_t i = 0; i < n_points; i++)
    {
        y1[i] = std::sin(x1[i]) * x2[i] * std::atan(x3[i]);
    }

    auto end_time = std::chrono::high_resolution_clock::now();
    auto duration = std::chrono::duration_cast<std::chrono::milliseconds>(end_time - start_time);
    std::cout << "Duration: " << duration.count() << " ms";

    return 0;
}

IA32: 209 ms SSE: 205 ms SSE2: 75 ms AVX: 371 ms

As for specific version of Visual Studio, this is 2013 Express for Desktop Update 1 (Version 12.0.30110.00 Update 1)

Upvotes: 3

Views: 869

Answers (2)

Mgetz
Mgetz

Reputation: 5128

So based on @Lưu Vĩnh Phúc I investigated a bit, you can get this to vectorize very nicely but not using std::vector or std::valarray, I also had to alias the pointers when I used std::unique_ptr otherwise that too would block vectorization.

#include <chrono>
#include <random>
#include <math.h>
#include <iostream>
#include <string>
#include <valarray>
#include <functional>
#include <memory>

#pragma intrinsic(sin, atan)
int wmain(int argc, wchar_t* argv[])
{

    // Set up random number generator
    std::random_device rd;
    std::mt19937 eng(rd());
    std::normal_distribution<double> dist;

    // Number of calculations to do
    const uint32_t n_points = 5000000;

    // Input vectors
    std::unique_ptr<double[]> x1 = std::make_unique<double[]>(n_points);
    std::unique_ptr<double[]> x2 = std::make_unique<double[]>(n_points);
    std::unique_ptr<double[]> x3 = std::make_unique<double[]>(n_points);

    // Output vectors
    std::unique_ptr<double[]> y1 = std::make_unique<double[]>(n_points);
    auto random = std::bind(dist, eng);
    // Fill inputs
    for (size_t i = 0; i < n_points; i++)
    {
        x1[i] = random();
        x2[i] = random();
        x3[i] = random();
        y1[i] = 0.0;
    }

    // Start timer
    auto start_time = std::chrono::high_resolution_clock::now();

    // Do math loop
    double * x_1 = x1.get(), *x_2 = x2.get(), *x_3 = x3.get(), *y_1 = y1.get();
    for (size_t i = 0; i < n_points; ++i)
    {
        y_1[i] = sin(x_1[i]) * x_2[i] * atan(x_3[i]);
    }

    auto end_time = std::chrono::high_resolution_clock::now();
    auto duration = std::chrono::duration_cast<std::chrono::milliseconds>(end_time - start_time);
    std::cout << "Duration: " << duration.count() << " ms";
    std::cin.ignore();
    return 0;
}

On my machine compiled with /arch:avx this took 103ms, /arch:IA32: 252ms, nothing set: 98ms

Looking at the generated assembly it seems the vector functions are implemented using SSE, as such using AVX instructions around them would cause an impedance and slow things down. Hopefully MS will implement AVX versions in the future.

The relevant asm lacking vzeroupper:

$LL3@wmain:
    vmovupd xmm0, XMMWORD PTR [esi]
    call    ___vdecl_sin2
    mov eax, DWORD PTR tv1250[esp+10212]
    vmulpd  xmm0, xmm0, XMMWORD PTR [eax+esi]
    mov eax, DWORD PTR tv1249[esp+10212]
    vmovaps XMMWORD PTR tv1240[esp+10212], xmm0
    vmovupd xmm0, XMMWORD PTR [eax+esi]
    call    ___vdecl_atan2
    dec DWORD PTR tv1260[esp+10212]
    lea esi, DWORD PTR [esi+16]
    vmulpd  xmm0, xmm0, XMMWORD PTR tv1240[esp+10212]
    vmovupd XMMWORD PTR [edi+esi-16], xmm0
    jne SHORT $LL3@wmain

Versus the SSE2 asm note the same vector sin and atan calls:

$LL3@wmain:
    movupd  xmm0, XMMWORD PTR [esi]
    call    ___vdecl_sin2
    mov eax, DWORD PTR tv1250[esp+10164]
    movupd  xmm1, XMMWORD PTR [eax+esi]
    mov eax, DWORD PTR tv1249[esp+10164]
    mulpd   xmm0, xmm1
    movaps  XMMWORD PTR tv1241[esp+10164], xmm0
    movupd  xmm0, XMMWORD PTR [eax+esi]
    call    ___vdecl_atan2
    dec DWORD PTR tv1260[esp+10164]
    lea esi, DWORD PTR [esi+16]
    movaps  xmm1, XMMWORD PTR tv1241[esp+10164]
    mulpd   xmm1, xmm0
    movupd  XMMWORD PTR [edi+esi-16], xmm1
    jne SHORT $LL3@wmain

Other things of note:

  • VS is only using the bottom 128bits of the AVX register, despite being 256bits wide
  • There are no overloads of the vector functions for AVX
  • AVX2 isn't supported yet

Upvotes: 0

Cory Nelson
Cory Nelson

Reputation: 29981

When the CPU switches between using AVX and SSE instructions, it needs to save/restore the upper parts of the ymm registers and can result in a pretty large penalty.

Normally compiling with /arch:AVX will fix this for your own code, as it will use AVX128 instructions instead of SSE ones where possible. However in this case, it may be that your standard library's math functions are not implemented using AVX instructions, in which case you'd get a transition penalty for every function call. You'd have to post a disassembled version to be sure.

You often see VZEROUPPER being called before a transition to signal that the CPU doesn't need to save the upper parts of the registers, but the compiler is not smart enough to know if a function it calls requires it too.

Upvotes: 2

Related Questions