Reputation: 131

Element-wise vector mutiplication

What would be a faster (in terms of time required by the code) alternative to the following method for multiplication of two n element integer vectors:

{
// code for obtaining two n element int vectors, a and b
}
int temp = 0;  // a temporary variable
for (int ii = 0; ii < n; ++ii)
    temp += a[ii]*b[ii];

Edit: Received several nice ideas. I would have to check each one to see which one is the best. Surely each of the responses tell me something new.

Upvotes: 0

Answers (4)

Escualo

Reputation: 42072

Just use the C++ Standard Library:

#include<algorithm>
#include<iostream>
#include<vector>

int main() {
  std::vector<double> x = {1, 2, 3};
  std::vector<double> y = {4, 5, 6};
  double xy = std::inner_product(x.begin(), x.end(), y.begin(), 0);
  std::cout<<"inner product: "<<xy<<std::endl;

  return 0;

}

Edit: added timing information.

Just out of curiosity I added the following timing information.

The source code:

#include<algorithm>
#include<iostream>
#include<vector>
#include<random>
#include<boost/timer/timer.hpp>

int main(int argc, char* argv[]) {
  // get the desired number of elements
  if(argc!=2) {
    std::cerr<<"usage: "<<argv[0]<<" N"<<std::endl;
    return EXIT_FAILURE;
  }

  int N = std::stoi(argv[1]);

  // set-up the random number generator
  std::random_device rd;
  std::mt19937 gen(rd());
  std::uniform_real_distribution<> dis(-100, 100);

  // prepare the vectors
  std::vector<double> x, y;

  // fill the vectors with random numbers
  auto rgen = [&dis, &gen]() { return dis(gen); };  
  std::generate_n(std::back_inserter(x), N, rgen);
  std::generate_n(std::back_inserter(y), N, rgen);

  // Heat-up the cache (try commenting-out this line and you'll see
  // that the time increases for whatever algorithm you put firts)
  double xy = std::inner_product(x.begin(), x.end(), y.begin(), 0.0);
  std::cout<<"heated-up value: "<<xy<<std::endl;

  { // start of new timing scope
    // write a message to the assembly source
    boost::timer::auto_cpu_timer t;
    asm("##### START OF ALGORITHMIC APPROACH #####");
    double xy = std::inner_product(x.begin(), x.end(), y.begin(), 0.0);
    asm("##### END OF ALGORITHMIC APPROACH #####");
    std::cout<<"algorithmic value: "<<xy<<std::endl<<"timing info: ";
  } // end of timing scope


  { // start of new timing scope
    // write a message to the assembly source
    boost::timer::auto_cpu_timer t;
    asm("##### START OF HAND-CODED APPROACH #####");
    double tmp = 0.0;
    for(int k=0; k<N; k++) {
      tmp += x[k] * y[k];
    }
    asm("##### END OF HAND-CODED APPROACH #####");
    std::cout<<"hand-coded value: "<<tmp<<std::endl<<"timing info: ";
  } // end of timing scope


  return EXIT_SUCCESS;  
}

The environment: 2.7 GHz Intel Core i7; OS X 10.7.4; gcc 4.8.1;

The compilation command: g++ -O3 inner-product-assembly.cpp -std=c++11 -lboost_timer -lboost_system

Sample runs:

[11:01:58 ~/research/c++] ./a.out 10
heated-up value: 8568.75
algorithmic value: 8568.75
timing info:  0.000006s wall, 0.000000s user + 0.000000s system = 0.000000s CPU (n/a%)
hand-coded value: 8568.75
timing info:  0.000004s wall, 0.000000s user + 0.000000s system = 0.000000s CPU (n/a%)

[11:01:59 ~/research/c++] ./a.out 100
heated-up value: -13072.2
algorithmic value: -13072.2
timing info:  0.000006s wall, 0.000000s user + 0.000000s system = 0.000000s CPU (n/a%)
hand-coded value: -13072.2
timing info:  0.000004s wall, 0.000000s user + 0.000000s system = 0.000000s CPU (n/a%)

[11:02:03 ~/research/c++] ./a.out 1000
heated-up value: 80389.1
algorithmic value: 80389.1
timing info:  0.000010s wall, 0.000000s user + 0.000000s system = 0.000000s CPU (n/a%)
hand-coded value: 80389.1
timing info:  0.000007s wall, 0.000000s user + 0.000000s system = 0.000000s CPU (n/a%)

[11:02:04 ~/research/c++] ./a.out 10000
heated-up value: 89753.7
algorithmic value: 89753.7
timing info:  0.000041s wall, 0.000000s user + 0.000000s system = 0.000000s CPU (n/a%)
hand-coded value: 89753.7
timing info:  0.000039s wall, 0.000000s user + 0.000000s system = 0.000000s CPU (n/a%)

[11:02:05 ~/research/c++] ./a.out 100000
heated-up value: -461750
algorithmic value: -461750
timing info:  0.000292s wall, 0.000000s user + 0.000000s system = 0.000000s CPU (n/a%)
hand-coded value: -461750
timing info:  0.000282s wall, 0.000000s user + 0.000000s system = 0.000000s CPU (n/a%)

[11:02:07 ~/research/c++] ./a.out 1000000
heated-up value: 2.52643e+06
algorithmic value: 2.52643e+06
timing info:  0.002702s wall, 0.000000s user + 0.000000s system = 0.000000s CPU (n/a%)
hand-coded value: 2.52643e+06
timing info:  0.002660s wall, 0.000000s user + 0.000000s system = 0.000000s CPU (n/a%)

[11:02:09 ~/research/c++] ./a.out 10000000
heated-up value: 6.04128e+06
algorithmic value: 6.04128e+06
timing info:  0.026557s wall, 0.030000s user + 0.000000s system = 0.030000s CPU (113.0%)
hand-coded value: 6.04128e+06
timing info:  0.026335s wall, 0.030000s user + 0.000000s system = 0.030000s CPU (113.9%)

[11:02:11 ~/research/c++] ./a.out 100000000
heated-up value: 2.27043e+07
algorithmic value: 2.27043e+07
timing info:  0.264547s wall, 0.270000s user + 0.000000s system = 0.270000s CPU (102.1%)
hand-coded value: 2.27043e+07
timing info:  0.264346s wall, 0.260000s user + 0.000000s system = 0.260000s CPU (98.4%)

So, the speed advantage of using the hand-coded approach seems to matter only if you use smaller arrays. Past the 10,000 mark, I would consider their run time identical, but I prefer the algorithmic approach because it is simpler to write and maintain, and it may benefit from updates to the library.

As usual, this timing info should be taken with a grain of salt.

Upvotes: 2

Peter

Reputation: 14937

The only way I can see to make this faster in C++ is to fix the "loop carried dependency", which means that each iteration has to wait until the previous temp value is available for the sum. This can be accomplished by unrolling and using several accumulation variables:

int t0=0, t1=0, t2=0, t3=0;
for (int ii = 0; ii < n; ii += 4) {
    t0 += a[ii]*b[ii];
    t1 += a[ii+1]*b[ii+1];
    t2 += a[ii+2]*b[ii+2];
    t3 += a[ii+3]*b[ii+3];
}
int temp = t0 + t1 + t2 + t3;

Modern processors can execute multiple operations per cycle, but only if there's no dependency in the way. On my system, this yields a roughly 20% improvement. Note: n has to be a multiple of 4, or you need to add a loop "epilog" that finishes the elements left over. Test and measure! I have no idea if 4 is the "correct" amount of unrolling.

You can get much more improvement by calling processor-specific SIMD intrinsics, but that's not standard C++.

Upvotes: 7

bames53

Reputation: 88155

#include <valarray>

int main() {
  std::valarray<int> a {1, 2, 3, 4, 5};
  std::valarray<int> b {3, 4, 5, 6, 7};

  auto c = (a * b).sum();
}

valarray can use SIMD and other vector instructions, although it seems to be regularly neglected in common implementations.

Upvotes: 1

Magnus

Reputation: 1640

Quick answer: That's probably the fastest solution. Long Answer: there are two definitions of fast. If you're looking a quick and uncomplicated solution the above code should suit well. If you're looking for a fast running code, I don't know if reimplementing the for-loop with a while-loop or using libraries/packages will help. Good Luck!

Upvotes: 0

Element-wise vector mutiplication

Answers (4)

Edit: added timing information.

Related Questions