Reputation: 2625
Some times ago, I asked the following question "How to count number of executed instructions of a process id including child processes", and @M-Iduoad kindly provided a solution with pgrep
to capture all child PIDs and use it with -p in perf stat. It works great!
However, one problem I encountered is with multi-threaded application, and when a new thread is being spawned. Since I'm not a fortune teller (too bad!), I don't know tid
of the newly generated threads, and therefore I can't add them in the perf stat
's -p or -t parameter.
As an example, let's assume I have a multithreaded nodejs server (deployed as a container on top of Kubernetes) with the following pstree
:
root@node2:/home/m# pstree -p 4037791
node(4037791)─┬─sh(4037824)───node(4037825)─┬─{node}(4037826)
│ ├─{node}(4037827)
│ ├─{node}(4037828)
│ ├─{node}(4037829)
│ ├─{node}(4037830)
│ └─{node}(4037831)
├─{node}(4037805)
├─{node}(4037806)
├─{node}(4037807)
├─{node}(4037808)
├─{node}(4037809)
├─{node}(4037810)
├─{node}(4037811)
├─{node}(4037812)
├─{node}(4037813)
└─{node}(4037814)
Of course, I can have the following perf stat
command to watch its threads:
perf stat --per-thread -e instructions,cycles,task-clock,cpu-clock,cpu-migrations,context-switches,cache-misses,duration_time -p $(pgrep --ns 4037791 | paste -s -d ",")
It works fine with a single threaded nodejs application. But in case of a multi-threaded service, as soon as it receives a request, the pstree
output would be look like this:
root@node2:/home/m# pstree -p 4037791
node(4037791)─┬─sh(4037824)───node(4037825)─┬─{node}(4037826)
│ ├─{node}(4037827)
│ ├─{node}(4037828)
│ ├─{node}(4037829)
│ ├─{node}(4037830)
│ ├─{node}(4037831)
│ ├─{node}(1047898)
│ ├─{node}(1047899)
│ ├─{node}(1047900)
│ ├─{node}(1047901)
│ ├─{node}(1047902)
│ ├─{node}(1047903)
│ ├─{node}(1047904)
│ ├─{node}(1047905)
│ ├─{node}(1047906)
│ ├─{node}(1047907)
│ ├─{node}(1047908)
│ ├─{node}(1047909)
│ ├─{node}(1047910)
│ ├─{node}(1047911)
│ ├─{node}(1047913)
│ ├─{node}(1047914)
│ ├─{node}(1047919)
│ ├─{node}(1047920)
│ ├─{node}(1047921)
│ └─{node}(1047922)
├─{node}(4037805)
├─{node}(4037806)
├─{node}(4037807)
├─{node}(4037808)
├─{node}(4037809)
├─{node}(4037810)
├─{node}(4037811)
├─{node}(4037812)
├─{node}(4037813)
└─{node}(4037814)
Therefore, my previous perf stat
command would not capture the stats of the newly generated threads. I mean, it may capture accumulated instructions but it's definitely not showing in a "per-thread" format.
Is there any way that I can use --per-thread
in perf stat and capture stats of the newly spawned threads in a multithreaded application? It seems to only work with -p
or -t
to follow a fixed set of threads that already exist when perf
starts, and won't follow new ones.
There's a similar question here for perf record
but I'm using perf stat
. Also, that doesn't seem to separate the recorded profile by thread, so it's just equivalent to perf stat node ...
Unless there's a way to process the recorded data to separate it out by thread after the fact?
perf
isn't a requirement if there's something else that works:Any other potential solutions that help me dynamically count "instructions,cycles,task-clock,cpu-clock,cpu-migrations,context-switches,cache-misses" per threads of a given PID (including newly spawned threads) is acceptable, whether using perf
or anything else!
Upvotes: 3
Views: 1126
Reputation: 22670
The combination of perf record -s
and perf report -T
should give you the information you need.
To demonstrate, take the following example code using threads with well-defined instruction counts:
#include <cstdint>
#include <thread>
void work(int64_t count) {
for (int64_t i = 0; i < count; i++);
}
int main() {
std::thread first(work, 100000000ll);
std::thread second(work, 400000000ll);
std::thread third(work, 800000000ll);
first.join();
second.join();
third.join();
}
(Compile without optimization!)
Now, use perf record
as a prefix command. It will follow all spawned processes and threads.
$ perf record -s -e instructions -c 1000000000 ./a.out
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.003 MB perf.data (5 samples) ]
To display the statistics nicely:
$ perf report -T
[... snip ...]
# PID TID instructions:u
270682 270683 500003888
270682 270684 2000001866
270682 270685 4000002177
The parameters for perf record
are a little bit tricky. -s
writes separate records with fairly precise numbers - they do not depend on the instruction samples (generated every 1000000000 instructions). However, perf report
, even with -T
fails when it does not find a single sample. So you need to set a instruction sample count -c
(or frequency) that triggers at least once. Any sample will do, it does not need a sample per thread.
Alternatively, you could look at the raw records from perf.data
. Then you can actually tell perf record
to not collect any samples.
$ perf record -s -e instructions -n ./a.out
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.003 MB perf.data ]
But you need to filter out the relevant records and there might be additional records you need to sum up.
$ perf script -D | grep PERF_RECORD_READ | grep -v " 0$"
# Annotation by me PID TID
213962455637481 0x760 [0x40]: PERF_RECORD_READ: 270887 270888 instructions:u 500003881
213963194850657 0x890 [0x40]: PERF_RECORD_READ: 270887 270889 instructions:u 2000001874
213964190418415 0x9c0 [0x40]: PERF_RECORD_READ: 270887 270890 instructions:u 4000002175
Upvotes: 5