How to count number of executed instructions of a process id including all future child threads

Question

Some times ago, I asked the following question "How to count number of executed instructions of a process id including child processes", and @M-Iduoad kindly provided a solution with pgrep to capture all child PIDs and use it with -p in perf stat. It works great!

However, one problem I encountered is with multi-threaded application, and when a new thread is being spawned. Since I'm not a fortune teller (too bad!), I don't know tid of the newly generated threads, and therefore I can't add them in the perf stat's -p or -t parameter.

As an example, let's assume I have a multithreaded nodejs server (deployed as a container on top of Kubernetes) with the following pstree:

root@node2:/home/m# pstree -p 4037791
node(4037791)─┬─sh(4037824)───node(4037825)─┬─{node}(4037826)
              │                             ├─{node}(4037827)
              │                             ├─{node}(4037828)
              │                             ├─{node}(4037829)
              │                             ├─{node}(4037830)
              │                             └─{node}(4037831)
              ├─{node}(4037805)
              ├─{node}(4037806)
              ├─{node}(4037807)
              ├─{node}(4037808)
              ├─{node}(4037809)
              ├─{node}(4037810)
              ├─{node}(4037811)
              ├─{node}(4037812)
              ├─{node}(4037813)
              └─{node}(4037814)

Of course, I can have the following perf stat command to watch its threads:

perf stat --per-thread -e instructions,cycles,task-clock,cpu-clock,cpu-migrations,context-switches,cache-misses,duration_time -p $(pgrep --ns 4037791 | paste -s -d ",")

It works fine with a single threaded nodejs application. But in case of a multi-threaded service, as soon as it receives a request, the pstree output would be look like this:

root@node2:/home/m# pstree -p 4037791
node(4037791)─┬─sh(4037824)───node(4037825)─┬─{node}(4037826)
              │                             ├─{node}(4037827)
              │                             ├─{node}(4037828)
              │                             ├─{node}(4037829)
              │                             ├─{node}(4037830)
              │                             ├─{node}(4037831)
              │                             ├─{node}(1047898)
              │                             ├─{node}(1047899)
              │                             ├─{node}(1047900)
              │                             ├─{node}(1047901)
              │                             ├─{node}(1047902)
              │                             ├─{node}(1047903)
              │                             ├─{node}(1047904)
              │                             ├─{node}(1047905)
              │                             ├─{node}(1047906)
              │                             ├─{node}(1047907)
              │                             ├─{node}(1047908)
              │                             ├─{node}(1047909)
              │                             ├─{node}(1047910)
              │                             ├─{node}(1047911)
              │                             ├─{node}(1047913)
              │                             ├─{node}(1047914)
              │                             ├─{node}(1047919)
              │                             ├─{node}(1047920)
              │                             ├─{node}(1047921)
              │                             └─{node}(1047922)
              ├─{node}(4037805)
              ├─{node}(4037806)
              ├─{node}(4037807)
              ├─{node}(4037808)
              ├─{node}(4037809)
              ├─{node}(4037810)
              ├─{node}(4037811)
              ├─{node}(4037812)
              ├─{node}(4037813)
              └─{node}(4037814)

Therefore, my previous perf stat command would not capture the stats of the newly generated threads. I mean, it may capture accumulated instructions but it's definitely not showing in a "per-thread" format.

Is there any way that I can use --per-thread in perf stat and capture stats of the newly spawned threads in a multithreaded application? It seems to only work with -p or -t to follow a fixed set of threads that already exist when perf starts, and won't follow new ones.

There's a similar question here for perf record but I'm using perf stat. Also, that doesn't seem to separate the recorded profile by thread, so it's just equivalent to perf stat node ... Unless there's a way to process the recorded data to separate it out by thread after the fact?

`perf` isn't a requirement if there's something else that works:

Any other potential solutions that help me dynamically count "instructions,cycles,task-clock,cpu-clock,cpu-migrations,context-switches,cache-misses" per threads of a given PID (including newly spawned threads) is acceptable, whether using perf or anything else!

Zulan · Accepted Answer

The combination of perf record -s and perf report -T should give you the information you need.

To demonstrate, take the following example code using threads with well-defined instruction counts:

#include 
#include 

void work(int64_t count) {
    for (int64_t i = 0; i < count; i++);
}

int main() {
    std::thread first(work, 100000000ll);
    std::thread second(work, 400000000ll);
    std::thread third(work, 800000000ll);
    first.join();
    second.join();
    third.join();
}

(Compile without optimization!)

Now, use perf record as a prefix command. It will follow all spawned processes and threads.

$ perf record -s -e instructions -c 1000000000 ./a.out
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.003 MB perf.data (5 samples) ]

To display the statistics nicely:

$ perf report -T
[... snip ...]
#    PID     TID  instructions:u
  270682  270683       500003888
  270682  270684      2000001866
  270682  270685      4000002177

The parameters for perf record are a little bit tricky. -s writes separate records with fairly precise numbers - they do not depend on the instruction samples (generated every 1000000000 instructions). However, perf report, even with -T fails when it does not find a single sample. So you need to set a instruction sample count -c (or frequency) that triggers at least once. Any sample will do, it does not need a sample per thread.

Alternatively, you could look at the raw records from perf.data. Then you can actually tell perf record to not collect any samples.

$ perf record -s -e instructions -n ./a.out             
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.003 MB perf.data ]

But you need to filter out the relevant records and there might be additional records you need to sum up.

$ perf script -D | grep PERF_RECORD_READ | grep -v " 0$"
# Annotation by me                              PID    TID 
213962455637481 0x760 [0x40]: PERF_RECORD_READ: 270887 270888 instructions:u 500003881
213963194850657 0x890 [0x40]: PERF_RECORD_READ: 270887 270889 instructions:u 2000001874
213964190418415 0x9c0 [0x40]: PERF_RECORD_READ: 270887 270890 instructions:u 4000002175

How to count number of executed instructions of a process id including all future child threads

`perf` isn't a requirement if there's something else that works:

Answers (1)

Related Questions

How to count number of executed instructions of a process id including all future child threads

perf isn't a requirement if there's something else that works:

Answers (1)

Related Questions

`perf` isn't a requirement if there's something else that works: