Zack
Zack

Reputation: 6606

Why is this virtual function call so expensive?

I recently modified a program to use virtual functions (in place of sequence if if-else conditions with static calls.) The modified program runs 8% slower than the original. That seems like too high of a cost for using virtual functions, so I must be doing something inefficient in the way I set the class hierarchy and virtual functions; but, I'm at a loss for how to track down the problem. (I see similar performance degradation using both clang on my Mac and gcc on Linux.)

The program is used to study different community detection algorithms. The program uses a nested loop to apply a series of user-specified objective functions to a variety of (graph, partition) pairs.

Here is a rough outline of the original code

int main(int argc, char* argv[]) {
    bool use_m1;
    bool use_m2;
    ...
    bool use_m10;

    //  set the various "use" flags based on argv

    for (Graph& g : graphsToStudy()) {
        for (Partition& p : allPartitions()) {
            if (use_m1) {
                M1::evaluate(g, p);
            }
            if (use_m2) {
                M2::evaluate(g,p);
            }
            // and so on
        }
    }

To make the code easier to maintain, I created a class structure for the different objective functions, and iterated through an array of pointers:

class ObjectiveFunction {
public:
    virtual double eval(Graph& g, Partition& p) = 0;
}

class ObjFn1 : public ObjectiveFunction {
public:
    virtual double eval(Graph& g, Partition& p) {
        return M1::evaluate(g,p);
   }
}

class ObjFn2 : public ObjectiveFunction {
public:
    virtual double eval(Graph& g, Partition& p) {
        return M2::evaluate(g,p);
   }
}


int main(int argc, char* argv[]) {
    vector<ObjectiveFunction*> funcs;
    fill_funcs_based_on_opts(funcs, argc, argv);

    for (Graph& g : graphsToStudy()) {
        for (Partition& p : allPartitions()) {
            // funcs contains one object for each function selected by user.
            for (ObjectiveFunction* fp : funcs) {
                fp->evaluate(g, p);
            }
        }
    }

Given that generating graphs and partitions, as well as the objective functions themselves are moderately computationally intensive, the addition of the virtual function call should be almost unnoticeable. Any ideas what I may have done wrong; or how to track it down? I tried using callgrind, but am not seeing any insights.

Perhaps I am just incorrectly interpreting the output of callgrind_annotate. In the example below, Neo::Context::evaluatePartition is analogous to ObjFn1::evaluate in the example above.

  1. Why is this function listed four different times with different source files? This method is only ever called from function main in timeMetrics.cpp.

  2. What does src/lib/PartitionIterator.h:main refer to? There is no main function in PartitionIterator.h.

  3. Why does 414,219,420 appear twice in the source code listing for evaluatePartition? Isn't the first number supposed to represent the overhead of the function call?


35,139,513,913  PROGRAM TOTALS
17,029,020,600  src/lib/metrics/Neo.h:gvcd::metrics::Neo::Context<unsigned int, unsigned char, unsigned int>::evaluatePartition(gvcd::Partition<unsigned int, unsigned int> const&, bool)  [bin/timeMetrics_v]  
7,168,741,865  /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/vector:gvcd::Partition<unsigned int, unsigned int>::buildMembersh ipList()  
4,418,473,884  src/lib/Partition.h:gvcd::Partition<unsigned int, unsigned int>::buildMembershipList() [bin/timeMetrics_v]  
1,459,239,657  src/lib/PartitionIterator.h:main  
1,288,682,640  /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/vector:gvcd::metrics::Neo::Context<unsigned int, unsigned char, u nsigned int>::evaluatePartition(gvcd::Partition<unsigned int, unsigned int> const&, bool)  
1,058,560,740  src/lib/Partition.h:gvcd::metrics::Neo::Context<unsigned int, unsigned char, unsigned int>::evaluatePartition(gvcd::Partition<unsigned int, unsigned int> const&, bool)  
1,012,736,608  src/perfEval/timeMetrics.cpp:main [bin/timeMetrics_v]    443,847,782  /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/vector:main 
368,372,912  /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/memory:gvcd::Partition<unsigned int, unsigned int>::buildMembersh ipList()    
322,170,738  /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/ostream:main
    92,048,760  src/lib/SmallGraph.h:gvcd::metrics::Neo::Context<unsigned int, unsigned char, unsigned int>::evaluatePartition(gvcd::Partition<unsigned int, unsigned int> const&, bool)
    84,549,144  ???:szone_free_definite_size [/usr/lib/system/libsystem_malloc.dylib]
    54,212,938  ???:tiny_free_list_add_ptr [/usr/lib/system/libsystem_malloc.dylib]



            .          virtual double
  414,219,420          evaluatePartition(const Partition <VertexID, SetBitmap> &p, bool raw = false) {
  414,219,420            uint_wn_t raw_answer = Neo::evaluatePartition(*(this->g), p);
            .            return (double) (raw ? raw_answer : max_neo - raw_answer);
            .          }
            .        }; // end Context

Upvotes: 2

Views: 389

Answers (3)

U007D
U007D

Reputation: 6358

Where I can, I prefer static over dynamic dispatch -- dynamic dispatch can cost you by preventing optimizations like function inlining, and with the double-dereference involved with vtables you can suffer from poor locality (instruction cache misses).

I suspect the lion's share of the difference in performance is from losing the benefits of optimizations performed on static dispatch. It might be interesting to try disabling inlining on the original code to see how much of an benefit you were enjoying.

Upvotes: 0

Ben Voigt
Ben Voigt

Reputation: 283981

It's expensive because you're actually using polymorphism, which defeats the branch predictor.

It may help the branch predictor if you replace collection iteration with an intrinsic linked list:

class ObjectiveFunction
{
    ObjectiveFunction* m_next;
    virtual double evaluate(Graph& g, Partition& p) = 0;

  protected:
    ObjectiveFunction(ObjectiveFunction* next = nullptr) : m_next(next) {}

    // for gcc use __attribute__((always_inline))
    // for MSVC use __forceinline
    void call_next(Graph& g, Partition& p)
    {
        if (m_next) m_next->eval(g, p);
    }
  public:
    virtual void eval(Graph& g, Partition& p) = 0;
};

Now, instead of one line of code inside the loop reaching many different functions, the call_next() function (which should be the last step of each individual eval overload) should be inlined into each of those overloads, and at runtime, each inlined copy of that indirect-call instruction will repeatedly call just one function, resulting in 100% branch prediction.

Upvotes: 1

Loki Astari
Loki Astari

Reputation: 264749

Lets fix the obvious first:

In both versions you do this:

foreach (Graph g : graphsToStudy()) {
    foreach (Partition p : allPartitions()) {

Unless Graph/Partition are easy and small to copy then most of your work will be here.

foreach (Graph& g : graphsToStudy()) {
           // ^
    foreach (Partition& p : allPartitions()) {
                   // ^

The second question I have. This does not seem like the correct usage of virtual functions. Your original code looks totally fine in this use case where multiple version of evaluate() are being called on each (g, p) object pair.

Now if you only called every one of the evaluate() functions then it might be a better use case, but then you no longer need that inner loop:

 foreach (ObjectiveFunction* fp : funcs) {

Upvotes: 2

Related Questions