Reputation: 6175
Consider this code under gcc 4.5.1 (Ubuntu 10.04, intel core2duo 3.0 Ghz) It's just 2 tests, in the first one I make a direct call on virtual fucnion and in the second I call it via a Wrapper class :
test.cpp
#define ITER 100000000
class Print{
public:
typedef Print* Ptr;
virtual void print(int p1, float p2, float p3, float p4){/*DOES NOTHING */}
};
class PrintWrapper
{
public:
typedef PrintWrapper* Ptr;
PrintWrapper(Print::Ptr print, int p1, float p2, float p3, float p4) :
m_print(print), _p1(p1),_p2(p2),_p3(p3),_p4(p4){}
~PrintWrapper(){}
void execute()
{
m_print->print(_p1,_p2,_p3,_p4);
}
private:
Print::Ptr m_print;
int _p1;
float _p2,_p3,_p4;
};
Print::Ptr p = new Print();
PrintWrapper::Ptr pw = new PrintWrapper(p, 1, 2.f,3.0f,4.0f);
void test1()
{
//-------------test 1-------------------------
for (auto var = 0; var < ITER; ++var)
{
p->print(1, 2.f,3.0f,4.0f);
}
}
void test2()
{
//-------------test 2-------------------------
for (auto var = 0; var < ITER; ++var)
{
pw->execute();
}
}
int main()
{
test1();
test2();
}
I profiled it with gprof and objdump :
g++ -c -std=c++0x -pg -g -O2 test.cpp
objdump -d -M intel -S test.o > objdump.txt
g++ -pg test.o -o test
./test
gprof test > gprof.output
in gprof.output I observed that test2() takes more time than test1() but I can't explain it
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls ms/call ms/call name
49.40 0.41 0.41 1 410.00 540.00 test2()
31.33 0.67 0.26 200000000 0.00 0.00 Print::print(int, float, float, float)
19.28 0.83 0.16 1 160.00 290.00 test1()
0.00 0.83 0.00 1 0.00 0.00 global constructors keyed to p
The assembly code in objdump.txt doesn't help me either :
//-------------test 1-------------------------
for (auto var = 0; var < ITER; ++var)
15: 83 c3 01 add ebx,0x1
{
p->print(1, 2.f,3.0f,4.0f);
18: 8b 10 mov edx,DWORD PTR [eax]
1a: c7 44 24 10 00 00 80 mov DWORD PTR [esp+0x10],0x40800000
21: 40
22: c7 44 24 0c 00 00 40 mov DWORD PTR [esp+0xc],0x40400000
29: 40
2a: c7 44 24 08 00 00 00 mov DWORD PTR [esp+0x8],0x40000000
31: 40
32: c7 44 24 04 01 00 00 mov DWORD PTR [esp+0x4],0x1
39: 00
3a: 89 04 24 mov DWORD PTR [esp],eax
3d: ff 12 call DWORD PTR [edx]
//-------------test 2-------------------------
for (auto var = 0; var < ITER; ++var)
65: 83 c3 01 add ebx,0x1
~PrintWrapper(){}
void execute()
{
m_print->print(_p1,_p2,_p3,_p4);
68: 8b 10 mov edx,DWORD PTR [eax]
6a: 8b 70 10 mov esi,DWORD PTR [eax+0x10]
6d: 8b 0a mov ecx,DWORD PTR [edx]
6f: 89 74 24 10 mov DWORD PTR [esp+0x10],esi
73: 8b 70 0c mov esi,DWORD PTR [eax+0xc]
76: 89 74 24 0c mov DWORD PTR [esp+0xc],esi
7a: 8b 70 08 mov esi,DWORD PTR [eax+0x8]
7d: 89 74 24 08 mov DWORD PTR [esp+0x8],esi
81: 8b 40 04 mov eax,DWORD PTR [eax+0x4]
84: 89 14 24 mov DWORD PTR [esp],edx
87: 89 44 24 04 mov DWORD PTR [esp+0x4],eax
8b: ff 11 call DWORD PTR [ecx]
How can we explain such a difference ?
Upvotes: 6
Views: 361
Reputation: 231123
In test2()
, the program must first load pw
from the heap, then call pw->execute()
(which incurs call overhead), then load pw->m_print
as well as the _p1
through _p4
arguments, then load the vtable pointer for pw
, then load the vtable slot for pw->Print
, then call pw->Print
. Because the compiler can't see through the virtual call, it then must assume all of these values have changed for the next iteration, and reload them all.
In test()
, the arguments are inline in the code segment, and we only have to load p
, the vtable pointer, and the vtable slot. We've saved five loads this way. This could easily account for the time difference.
In short - the loads of pw->m_print
and pw->_p1
through pw->_p4
are the culprit here.
Upvotes: 3
Reputation: 40659
Are you actually printing, or just calling a function called Print that does nothing? If you're actually printing, you're weighing the hair on the hog.
Regardless, gprof is blind to I/O, so it's only looking at your CPU usage.
Notice, Test2 does 11 moves before the call, while Test1 does only 6. So if more PC samples land in Test2, that's not surprising.
Upvotes: 1
Reputation: 5470
One difference is that the values you're passing into print in test1 are going to be stored in the instructions themselves, whereas the stuff in PrintWrapper have to be loaded from the heap. You can see this going on in the assembler. Could be running into different memory access times for that reason.
Upvotes: 2
Reputation: 36049
In the direct call, the compiler can optimize away the virtuality of the function because the type of p
is known at compile time (because the only assignment to p
is visible). In PrintWrapper
, the type is erased and the virtual function call must be performed.
Upvotes: 1