Matt Munson
Matt Munson

Reputation: 3003

optimizing `std::vector operator []` (vector access) when it becomes a bottleneck

gprof says that my high computing app spends 53% of its time inside std::vector <...> operator [] (unsigned long), 32% of which goes to one heavily used vector. Worse, I suspect that my parallel code failing to scale beyond 3-6 cores is due to a related memory bottleneck. While my app does spend a lot of time accessing and writing memory, it seems like I should be able (or at least try) to do better than 52%. Should I try using dynamic arrays instead (size remains constant in most cases)? Would that be likely to help with possible bottlenecks?

Actually, my preferred solution would be to solve the bottleneck and leave the vectors as is for convenience. Based on the above, are there any likely culprits or solutions (tcmalloc is out)?

Upvotes: 7

Views: 1489

Answers (5)

Mike Dunlavey
Mike Dunlavey

Reputation: 40679

The vector class is heavily liked and provides a certain amount of convenience, at the expense of performance, which is fine when you don't particularly need performance.

If you really need performance, it won't hurt you too much to bypass the vector class and go directly to a simple old hand-made array, whether statically or dynamically allocated. Then 1) the time you currently spend indexing should essentially disappear, speeding up your app by that amount, and 2) you can move on to whatever the "next big thing" is that takes time in your app.

EDIT: Most programs have a lot more room for speedup than you might suppose. I made a walk-through project to illustrate this. If I can summarize it really quickly, it goes like this:

  • Original time is 2.7 msec per "job" (the number of "jobs" can be varied to get enough run-time to analyze it).

  • First cut showed roughly 60% of time was spent in vector operations, including indexing, appending, and removing. I replaced with a similar vector class from MFC, and time decreased to 1.8 msec/job. (That's a 1.5x or 50% speedup.)

  • Even with that array class, roughly 40% of time was spent in the [] indexing operator. I wanted it to index directly, so I forced it to index directly, not through the operator function. That reduced time to 1.5 msec/job, a 1.2x speedup.

  • Now roughly 60% of the time is adding/removing items in arrays. An additional fraction was spent in "new" and "delete". I decided to chuck the arrays and do two things. One was to use do-it-yourself linked lists, and to pool used objects. The first reduced time to 1.3 msec (1.15x). The second reduced it to 0.44 msec (2.95x).

  • Of that time, I found that about 60% of the time was in code I had written to do indexing into the list (as if it were an array). I decided that could be done instead just by having a pointer directly into the list. Result: 0.14 msec (3.14x).

  • Now I found that nearly all the time was being spent in a line of diagnostic I/O I was printing to the console. I decided to get rid of that: 0.0037 msec (38x).

I could have kept going, but I stopped. The overall time per job was reduced by a compounded factor of about 700x.

What I want you to take away is if you need performance bad enough to deviate from what might be considered the accepted ways of doing things, you don't have to stop after one "bottleneck". Just because you got a big speedup doesn't mean there are no more. In fact the next "bottleneck" might be bigger than the first, in terms of speedup factor. So raise your expectations of speedup you can get, and go for broke.

Upvotes: 0

Sergey Podobry
Sergey Podobry

Reputation: 7189

I suspect that gprof prevents functions to be inlined. Try to use another profiling method. std::vector operator [] cannot be bottleneck because it doesn't differ much from raw array access. SGI implementaion is shown below:

reference operator[](size_type __n) { return *(begin() + __n); }
iterator begin() { return _M_start; }

Upvotes: 2

Violet Giraffe
Violet Giraffe

Reputation: 33589

Did you examine your memory access pattern itself? It might be inefficient - cache unfriendly.

Upvotes: 4

6502
6502

Reputation: 114539

You cannot trust gprof for high-speed code profiling, you should instead use a passive profiling method like oprofile to get the real picture.

As an alternative you could profile by manual code alteration (e.g. calling a computation 10 times instead of one and checking how much the execution time increases). Note that this is however going to be influenced by cache issues so YMMV.

Upvotes: 1

Eugene
Eugene

Reputation: 3417

Did you try to use raw pointer while array accessing?

// regular place

for (int i = 0; i < arr.size(); ++i)
    wcout << arr[i];

// In bottleneck

int *pArr = &arr.front();

for (int i = 0; i < arr.size(); ++i)
    wcout << pArr[i];

Upvotes: 2

Related Questions