Reputation: 1678
As far as my knowledge on resource management goes, allocating something on the heap (operator new
) should always be slower than allocating on the stack (automatic storage), because the stack is a LIFO-based structure, thus it requires minimal bookkeeping, and the pointer of the next address to allocate is trivial.
So far, so good. Now look at the following code:
/* ...includes... */
using std::cout;
using std::cin;
using std::endl;
int bar() { return 42; }
int main()
{
auto s1 = std::chrono::steady_clock::now();
std::packaged_task<int()> pt1(bar);
auto e1 = std::chrono::steady_clock::now();
auto s2 = std::chrono::steady_clock::now();
auto sh_ptr1 = std::make_shared<std::packaged_task<int()> >(bar);
auto e2 = std::chrono::steady_clock::now();
auto first = std::chrono::duration_cast<std::chrono::nanoseconds>(e1-s1);
auto second = std::chrono::duration_cast<std::chrono::nanoseconds>(e2-s2);
cout << "Regular: " << first.count() << endl
<< "Make shared: " << second.count() << endl;
pt1();
(*sh_ptr1)();
cout << "As you can see, both are working correctly: "
<< pt1.get_future().get() << " & "
<< sh_ptr1->get_future().get() << endl;
return 0;
}
The results seem to contradict the stuff explained above:
Regular: 6131
Make shared: 843
As you can see, both are working correctly: 42 & 42
Program ended with exit code: 0
In the second measurement, apart from the call of operator new
, the constructor of the std::shared_ptr
(auto sh_ptr1
) has to finish. I can't seem to understand why is this faster then regular allocation.
What is the explanation for this?
Upvotes: 19
Views: 1762
Reputation: 157484
The problem is that the first call to the constructor of std::packaged_task
is responsible for initializing a load of per-thread state that is then unfairly attributed to pt1
. This is a common problem of benchmarking (particularly microbenchmarking) and is alleviated by warmup; try reading How do I write a correct micro-benchmark in Java?
If I copy your code but run both parts first, the results are the same to within the limits of the resolution of the system clock. This demonstrates another issue of microbenchmarking, that you should run small tests multiple times to allow total time to be measured accurately.
With warmup and running each part 1000 times, I get the following (example):
Regular: 132.986
Make shared: 211.889
The difference (approx 80ns) accords well with the rule of thumb that malloc takes 100ns per call.
Upvotes: 29
Reputation: 10007
I've tried your example at ideone and got a result similar to yours:
Regular: 67950
Make shared: 696
Then I reversed the order of tests:
auto s2 = std::chrono::steady_clock::now();
auto sh_ptr1 = std::make_shared<std::packaged_task<int()> >(bar);
auto e2 = std::chrono::steady_clock::now();
auto s1 = std::chrono::steady_clock::now();
std::packaged_task<int()> pt1(bar);
auto e1 = std::chrono::steady_clock::now();
and found an opposite result:
Regular: 548
Make shared: 68065
So that's not difference of stack vs heap, but difference of first and second call. Maybe you need to look into the internals of std::packaged_task
.
Upvotes: 5
Reputation: 727047
It is a problem with your micro-benchmark: if you swap the order in which you measure the timing, you would get opposite results (demo).
It looks like the first-time call of std::packaged_task
constructor causes a big hit. Adding an untimed
std::packaged_task<int()> ignore(bar);
before measuring the time fixes this problem (demo):
Regular: 505
Make shared: 937
Upvotes: 8