Reputation: 6606
I recently modified a program to use virtual functions (in place of sequence if if-else
conditions with static calls.) The modified program runs 8% slower than the original. That seems like too high of a cost for using virtual functions, so I must be doing something inefficient in the way I set the class hierarchy and virtual functions; but, I'm at a loss for how to track down the problem. (I see similar performance degradation using both clang
on my Mac and gcc
on Linux.)
The program is used to study different community detection algorithms. The program uses a nested loop to apply a series of user-specified objective functions to a variety of (graph, partition) pairs.
Here is a rough outline of the original code
int main(int argc, char* argv[]) {
bool use_m1;
bool use_m2;
...
bool use_m10;
// set the various "use" flags based on argv
for (Graph& g : graphsToStudy()) {
for (Partition& p : allPartitions()) {
if (use_m1) {
M1::evaluate(g, p);
}
if (use_m2) {
M2::evaluate(g,p);
}
// and so on
}
}
To make the code easier to maintain, I created a class structure for the different objective functions, and iterated through an array of pointers:
class ObjectiveFunction {
public:
virtual double eval(Graph& g, Partition& p) = 0;
}
class ObjFn1 : public ObjectiveFunction {
public:
virtual double eval(Graph& g, Partition& p) {
return M1::evaluate(g,p);
}
}
class ObjFn2 : public ObjectiveFunction {
public:
virtual double eval(Graph& g, Partition& p) {
return M2::evaluate(g,p);
}
}
int main(int argc, char* argv[]) {
vector<ObjectiveFunction*> funcs;
fill_funcs_based_on_opts(funcs, argc, argv);
for (Graph& g : graphsToStudy()) {
for (Partition& p : allPartitions()) {
// funcs contains one object for each function selected by user.
for (ObjectiveFunction* fp : funcs) {
fp->evaluate(g, p);
}
}
}
Given that generating graphs and partitions, as well as the objective functions themselves are moderately computationally intensive, the addition of the virtual function call should be almost unnoticeable. Any ideas what I may have done wrong; or how to track it down? I tried using callgrind, but am not seeing any insights.
Perhaps I am just incorrectly interpreting the output of callgrind_annotate
. In the example below, Neo::Context::evaluatePartition
is analogous to ObjFn1::evaluate
in the example above.
Why is this function listed four different times with different
source files? This method is only ever called from function main
in timeMetrics.cpp
.
What does src/lib/PartitionIterator.h:main
refer to? There is no
main function in PartitionIterator.h
.
Why does 414,219,420 appear twice in the source code listing for
evaluatePartition
? Isn't the first number supposed to represent
the overhead of the function call?
35,139,513,913 PROGRAM TOTALS
17,029,020,600 src/lib/metrics/Neo.h:gvcd::metrics::Neo::Context<unsigned int, unsigned char, unsigned int>::evaluatePartition(gvcd::Partition<unsigned int, unsigned int> const&, bool) [bin/timeMetrics_v]
7,168,741,865 /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/vector:gvcd::Partition<unsigned int, unsigned int>::buildMembersh ipList()
4,418,473,884 src/lib/Partition.h:gvcd::Partition<unsigned int, unsigned int>::buildMembershipList() [bin/timeMetrics_v]
1,459,239,657 src/lib/PartitionIterator.h:main
1,288,682,640 /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/vector:gvcd::metrics::Neo::Context<unsigned int, unsigned char, u nsigned int>::evaluatePartition(gvcd::Partition<unsigned int, unsigned int> const&, bool)
1,058,560,740 src/lib/Partition.h:gvcd::metrics::Neo::Context<unsigned int, unsigned char, unsigned int>::evaluatePartition(gvcd::Partition<unsigned int, unsigned int> const&, bool)
1,012,736,608 src/perfEval/timeMetrics.cpp:main [bin/timeMetrics_v] 443,847,782 /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/vector:main
368,372,912 /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/memory:gvcd::Partition<unsigned int, unsigned int>::buildMembersh ipList()
322,170,738 /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/ostream:main
92,048,760 src/lib/SmallGraph.h:gvcd::metrics::Neo::Context<unsigned int, unsigned char, unsigned int>::evaluatePartition(gvcd::Partition<unsigned int, unsigned int> const&, bool)
84,549,144 ???:szone_free_definite_size [/usr/lib/system/libsystem_malloc.dylib]
54,212,938 ???:tiny_free_list_add_ptr [/usr/lib/system/libsystem_malloc.dylib]
. virtual double
414,219,420 evaluatePartition(const Partition <VertexID, SetBitmap> &p, bool raw = false) {
414,219,420 uint_wn_t raw_answer = Neo::evaluatePartition(*(this->g), p);
. return (double) (raw ? raw_answer : max_neo - raw_answer);
. }
. }; // end Context
Upvotes: 2
Views: 389
Reputation: 6358
Where I can, I prefer static over dynamic dispatch -- dynamic dispatch can cost you by preventing optimizations like function inlining, and with the double-dereference involved with vtables you can suffer from poor locality (instruction cache misses).
I suspect the lion's share of the difference in performance is from losing the benefits of optimizations performed on static dispatch. It might be interesting to try disabling inlining on the original code to see how much of an benefit you were enjoying.
Upvotes: 0
Reputation: 283981
It's expensive because you're actually using polymorphism, which defeats the branch predictor.
It may help the branch predictor if you replace collection iteration with an intrinsic linked list:
class ObjectiveFunction
{
ObjectiveFunction* m_next;
virtual double evaluate(Graph& g, Partition& p) = 0;
protected:
ObjectiveFunction(ObjectiveFunction* next = nullptr) : m_next(next) {}
// for gcc use __attribute__((always_inline))
// for MSVC use __forceinline
void call_next(Graph& g, Partition& p)
{
if (m_next) m_next->eval(g, p);
}
public:
virtual void eval(Graph& g, Partition& p) = 0;
};
Now, instead of one line of code inside the loop reaching many different functions, the call_next()
function (which should be the last step of each individual eval
overload) should be inlined into each of those overloads, and at runtime, each inlined copy of that indirect-call instruction will repeatedly call just one function, resulting in 100% branch prediction.
Upvotes: 1
Reputation: 264749
Lets fix the obvious first:
In both versions you do this:
foreach (Graph g : graphsToStudy()) {
foreach (Partition p : allPartitions()) {
Unless Graph/Partition are easy and small to copy then most of your work will be here.
foreach (Graph& g : graphsToStudy()) {
// ^
foreach (Partition& p : allPartitions()) {
// ^
The second question I have. This does not seem like the correct usage of virtual functions. Your original code looks totally fine in this use case where multiple version of evaluate()
are being called on each (g, p)
object pair.
Now if you only called every one of the evaluate()
functions then it might be a better use case, but then you no longer need that inner loop:
foreach (ObjectiveFunction* fp : funcs) {
Upvotes: 2