Tyson
Tyson

Reputation: 1238

C++ manipulation of HUGE vector using threads: can it be optimized further?

I have a very simple class that contains a single integer. The class has a single function that increments that integer by one. Basically:

class foo
{
public:
   int bar;
   foo(){bar = 1;}
   void inc() {bar++;}
};

I also have a vector that contains 500 million instances of that class (as objects, not pointers).

I have a thread pool that waits to process that vector, with 16 threads (the number of cores on my machine). When I want to process the data, I make the threads run this function (where start/end are the portions of the vector each thread should process, ie...the vector divided evenly into [number of threads] sections):

void threadFn(vector<foo> &arr, int startInx, int endInx)
{
    for (int i = startInx; i < endInx; i++)
    {
        foo &f = arr[i];
        f.inc();
    }
}

If I run this function on only 1 thread, it returns in ~750ms (which excludes the time it took to construct the vector). If I run this function on all 16 threads, it returns in ~100ms. So I'm getting 7.5x the speed, which is nice...but I want to know if there's anything I can do to push it further.

I've read a bit about false sharing, but am not exactly sure how to translate that problem into a practical optimization here in order to avoid it.

Basically I'm looking for any ideas that I can use to help to scale the speed of this algorithm up better with the number of cores used. Or is that not possible?

It might be worth noting beforehand, that my thread pool implementation is not part of the bottleneck. If I strip the loop out of the threadFn, the pool finishes its work in <1ms.

Upvotes: 0

Views: 240

Answers (1)

Yakk - Adam Nevraumont
Yakk - Adam Nevraumont

Reputation: 275510

Speed wise, your code is going to be fine; you are almost certainly memory IO bound.

Design wise, you probably want a span type rather than passing around a vector and a pair of ints.

The gsl has a span which represents a chunk of contiguous memory by storing two pointers. Writing one yourself is easy:

template<class T>
struct span {
  T* b=0; T* e=0;
  T* begin() const { return b; }
  T* end() const { return b; }
  span( T* s, T* f ):b(s),e(f) {}
  span( T* s, std::size_t length):span(s, s+length) {}

  span() = default;
  span(span const&) = default;
  span& operator=(span const&) = default;
  ~span() = default;

  template<class U, std::size_t N,
    class=std::enable_if_t< std::is_same<std::decay_t<U>, std::decay_t<T>>::value>
  >
  span( U(& arr)[N] ):span(arr, N) {}
  template<class C,
    class=decltype( std::declval<C>().data() ),
    class=std::enable_if_t< !std::is_same<std::decay_t<C>, span>::value >
  >
  span( C&& c ):span(c.data(), c.size()) {}

  // container-like operations:
  bool empty() const { return begin()==end(); }
  std::size_t size() const { return end()-begin(); }
  T& front() const { return *begin(); }
  T& back() const { return *std::prev(end()); }
  T* data() const { return begin(); }

  // create child spans:
  span without_front( std::size_t N=1 ) const {
    return { begin()+(std::min)(N, size()), end() };
  }
  span without_back( std::size_t N=1 ) const {
    return { begin(), end()-(std::min)(N, size()) };
  }
};

now your code looks slicker:

void threadFn(span<Foo> foos)
{
  for (foo& f:foos)
  { 
    f.inc();
  }
}

as a bonus, non-vector stored data now works, like a std::array or anything else.

span implicitly converts from any std container with a .data() member that returns a pointer.


If your eventual problem involves more complex operations, and less data, you'll find you are less memory bound and more CPU bound.

This will increase your speedup beyond 7.5x over single threaded. Eventually, as computation dominates over bandwidth, you could consider moving your work to the GPU; the time required to get to/from the GPU becomes the limiting factor.

You can usually determine if a process is CPU or memory bound by examining its growth rate in speed as you add more CPUs, assuming you have avoided contention in its implementation. Often you can even see an inflection point.

If your problem involves multiple stages, stacking them so that they occur "more locally" may eventually improve this as well.

Upvotes: 2

Related Questions