codablank1
codablank1

Reputation: 6175

C++ : difference of execution time between two call of a virtual function

Consider this code under gcc 4.5.1 (Ubuntu 10.04, intel core2duo 3.0 Ghz) It's just 2 tests, in the first one I make a direct call on virtual fucnion and in the second I call it via a Wrapper class :

test.cpp

#define ITER 100000000

class Print{

public:

typedef Print* Ptr;

virtual void print(int p1, float p2, float p3, float p4){/*DOES NOTHING */}

};

class PrintWrapper
{

    public:

      typedef PrintWrapper* Ptr;

      PrintWrapper(Print::Ptr print, int p1, float p2, float p3, float p4) :
      m_print(print), _p1(p1),_p2(p2),_p3(p3),_p4(p4){}

      ~PrintWrapper(){}

      void execute()
      { 
        m_print->print(_p1,_p2,_p3,_p4); 
      }

    private:

      Print::Ptr m_print;
      int _p1;
      float _p2,_p3,_p4;

};

 Print::Ptr p = new Print();
 PrintWrapper::Ptr pw = new PrintWrapper(p, 1, 2.f,3.0f,4.0f);

void test1()
{

 //-------------test 1-------------------------

 for (auto var = 0; var < ITER; ++var) 
 {
   p->print(1, 2.f,3.0f,4.0f);
 }

 }

 void test2()
 {

  //-------------test 2-------------------------

 for (auto var = 0; var < ITER; ++var) 
 {
   pw->execute();
 }

}

int main() 
{ 
  test1(); 
  test2();
}

I profiled it with gprof and objdump :

g++ -c -std=c++0x -pg -g -O2 test.cpp
objdump -d -M intel -S test.o > objdump.txt
g++ -pg test.o -o test
./test
gprof test > gprof.output

in gprof.output I observed that test2() takes more time than test1() but I can't explain it

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  ms/call  ms/call  name    
 49.40      0.41     0.41        1   410.00   540.00  test2()
 31.33      0.67     0.26 200000000     0.00     0.00  Print::print(int, float, float, float)
 19.28      0.83     0.16        1   160.00   290.00  test1()
  0.00      0.83     0.00        1     0.00     0.00  global constructors keyed to p

The assembly code in objdump.txt doesn't help me either :

 //-------------test 1-------------------------
 for (auto var = 0; var < ITER; ++var) 
  15:   83 c3 01                add    ebx,0x1
 {
   p->print(1, 2.f,3.0f,4.0f);
  18:   8b 10                   mov    edx,DWORD PTR [eax]
  1a:   c7 44 24 10 00 00 80    mov    DWORD PTR [esp+0x10],0x40800000
  21:   40 
  22:   c7 44 24 0c 00 00 40    mov    DWORD PTR [esp+0xc],0x40400000
  29:   40 
  2a:   c7 44 24 08 00 00 00    mov    DWORD PTR [esp+0x8],0x40000000
  31:   40 
  32:   c7 44 24 04 01 00 00    mov    DWORD PTR [esp+0x4],0x1
  39:   00 
  3a:   89 04 24                mov    DWORD PTR [esp],eax
  3d:   ff 12                   call   DWORD PTR [edx]

  //-------------test 2-------------------------
 for (auto var = 0; var < ITER; ++var) 
  65:   83 c3 01                add    ebx,0x1

      ~PrintWrapper(){}

      void execute()
      { 
        m_print->print(_p1,_p2,_p3,_p4); 
  68:   8b 10                   mov    edx,DWORD PTR [eax]
  6a:   8b 70 10                mov    esi,DWORD PTR [eax+0x10]
  6d:   8b 0a                   mov    ecx,DWORD PTR [edx]
  6f:   89 74 24 10             mov    DWORD PTR [esp+0x10],esi
  73:   8b 70 0c                mov    esi,DWORD PTR [eax+0xc]
  76:   89 74 24 0c             mov    DWORD PTR [esp+0xc],esi
  7a:   8b 70 08                mov    esi,DWORD PTR [eax+0x8]
  7d:   89 74 24 08             mov    DWORD PTR [esp+0x8],esi
  81:   8b 40 04                mov    eax,DWORD PTR [eax+0x4]
  84:   89 14 24                mov    DWORD PTR [esp],edx
  87:   89 44 24 04             mov    DWORD PTR [esp+0x4],eax
  8b:   ff 11                   call   DWORD PTR [ecx]

How can we explain such a difference ?

Upvotes: 6

Views: 361

Answers (4)

bdonlan
bdonlan

Reputation: 231123

In test2(), the program must first load pw from the heap, then call pw->execute() (which incurs call overhead), then load pw->m_print as well as the _p1 through _p4 arguments, then load the vtable pointer for pw, then load the vtable slot for pw->Print, then call pw->Print. Because the compiler can't see through the virtual call, it then must assume all of these values have changed for the next iteration, and reload them all.

In test(), the arguments are inline in the code segment, and we only have to load p, the vtable pointer, and the vtable slot. We've saved five loads this way. This could easily account for the time difference.

In short - the loads of pw->m_print and pw->_p1 through pw->_p4 are the culprit here.

Upvotes: 3

Mike Dunlavey
Mike Dunlavey

Reputation: 40659

Are you actually printing, or just calling a function called Print that does nothing? If you're actually printing, you're weighing the hair on the hog.

Regardless, gprof is blind to I/O, so it's only looking at your CPU usage.

Notice, Test2 does 11 moves before the call, while Test1 does only 6. So if more PC samples land in Test2, that's not surprising.

Upvotes: 1

Nathan Monteleone
Nathan Monteleone

Reputation: 5470

One difference is that the values you're passing into print in test1 are going to be stored in the instructions themselves, whereas the stuff in PrintWrapper have to be loaded from the heap. You can see this going on in the assembler. Could be running into different memory access times for that reason.

Upvotes: 2

thiton
thiton

Reputation: 36049

In the direct call, the compiler can optimize away the virtuality of the function because the type of p is known at compile time (because the only assignment to p is visible). In PrintWrapper, the type is erased and the virtual function call must be performed.

Upvotes: 1

Related Questions