Yves
Yves

Reputation: 12371

How to interpret the report of perf

I'm learning how to use the tool perf to profile my c++ project. Here is my code:

#include <iostream>
#include <thread>
#include <mutex>
#include <vector>


std::mutex mtx;
long long_val = 0;

void do_something(long &val)
{
    std::unique_lock<std::mutex> lck(mtx);
    for(int j=0; j<1000; ++j)
        val++;
}


void thread_func()
{
    for(int i=0; i<1000000L; ++i)
    {
        do_something(long_val);
    }
}


int main(int argc, char* argv[])
{
    std::vector<std::unique_ptr<std::thread>> threads;
    for(int i=0; i<100; ++i)
    {
        threads.push_back(std::move(std::unique_ptr<std::thread>(new std::thread(thread_func))));
    }
    for(int i=0; i<100; ++i)
    {
        threads[i]->join();
    }
    threads.clear();
    std::cout << long_val << std::endl;
    return 0;
}

To compile it, I run g++ -std=c++11 main.cpp -lpthread -g and then I get the executable file named a.out.

Then I run perf record --call-graph dwarf -- ./a.out and wait for 10 seconds, then I press Ctrl+c to interrupt the ./a.out because it needs too much time to execute.

Lastly, I run perf report -g graph --no-children and here is the output:

enter image description here

My goal is to find which part of the code is the heaviest. So it seems that this output could tell me do_something is the heaviest part(46.25%). But when I enter into do_something, I can not understand what it is: std::_Bind_simple, std::thread::_Impl etc.

So how to get more useful information from the output of perf report? Or we can't get more except the fact that do_something is the heaviest?

Upvotes: 1

Views: 1932

Answers (2)

Yves
Yves

Reputation: 12371

With the help of @Peter Cordes, I pose this answer. If you have something more useful, please feel free to pose your answers.

You forgot to enable optimization at all when you compiled, so all the little functions that should normally inline away are actually getting called. Add -O3 or at least -O2 to your g++ command line. Optionally also profile-guided optimization if you really want gcc to do a good job on hot loops.

After adding -O3, the output of perf report becomes:

enter image description here

Now we can get something useful from futex_wake and futex_wait_setup as we should know that mutex in C++11 is implemented by futex of Linux. So the result is that mutex is the hotspot in this code.

Upvotes: 1

doron
doron

Reputation: 28872

The issue here is that your mutexes are waiting on each other forcing your program to hit the scheduler often.

You would get better performance if you used fewer threads.

Upvotes: 0

Related Questions