Reputation: 115
According to man pages to consider the execution of child process when counting for events, inherit
bit can be set. inherit
can not be used when using PERF_FORMAT_GROUP. So, how can I include the execution of child process (execution of shell commands with in C source code, for example) so that it can be counted when sampling (PERF_FORMAT_GROUP) using perf_event_open
if PERF_FORMAT_GROUP is not specified, does this indicate that for each recorded sample, each sample record will include struct read_format
equal to the number of events or a sample will be recorded for each event alone.
Upvotes: 4
Views: 768
Reputation: 5241
As the question states the man pages state that inherit
cannot work with PERF_FORMAT_GROUP
:
Inherit does not work for some combinations of read_format values, such as PERF_FORMAT_GROUP.
Interestingly, there is a comment in google benchmark's code which states that they do indeed work fine together, in contradiction with the man page.
// We then proceed to populate the remaining fields in our attribute struct
// Note: the man page for perf_event_create suggests inherit = true and
// read_format = PERF_FORMAT_GROUP don't work together, but that's not the
// case.
attr.disabled = is_first;
attr.inherit = true;
So the code does use the inherit
option with PERF_FORMAT_GROUP
.
As a sanity check I've written a small program to check whether it works - I'm too lazy to figure out the exact way of using perf_event_open
so I've simply used google benchmark, and to test when inherit
is false I've simply set it to false in the above code and recompiled google benchmark.
#include <benchmark/benchmark.h>
#include <bits/stdc++.h>
#include <pthread.h>
using namespace std;
extern "C" void spin(int);
int ter = 0;
void* helper_thread(void *args) {
cpu_set_t cpuset;
CPU_ZERO(&cpuset);
CPU_SET(1, &cpuset);
assert(pthread_setaffinity_np(pthread_self(), sizeof(cpuset), &cpuset) == 0);
while (ter == 0) {
spin(26);
}
return NULL;
}
void BM(benchmark::State &state) {
cpu_set_t cpuset;
CPU_ZERO(&cpuset);
CPU_SET(2, &cpuset);
assert(pthread_setaffinity_np(pthread_self(), sizeof(cpuset), &cpuset) == 0);
ter = 0;
pthread_t thr;
assert(pthread_create(&thr, NULL, helper_thread, NULL) == 0);
spin(5000);
for (auto _ : state) {
spin(state.range(0));
}
ter = -1;
pthread_join(thr, NULL);
}
BENCHMARK(BM)->ArgsProduct({
benchmark::CreateDenseRange(884, 988, 26)
});
BENCHMARK_MAIN();
In the program I run two threads which both just keep spinning. void spin(int)
is just a small function which spins for the number of cycles passed as an argument, which I have tuned for my system (11th Gen Intel(R) Core(TM) i5-11400H). The two threads are pinned to different cores.
First, I'm using google benchmark with inherit
set to false
. On my system, cores 1 & 2 correspond to different physical cores, and I've isolated core 2 as well. As inherit
is false the PMU counters should only measure the cycles of the main thread, which is what is observed:
-----------------------------------------------------------------------------
Benchmark Time CPU Iterations CYCLES INSTRUCTIONS
-----------------------------------------------------------------------------
BM/884 202 ns 202 ns 3463599 884.004 180
BM/910 208 ns 208 ns 3368427 910.06 185
BM/936 214 ns 214 ns 3262906 936.159 190
BM/962 220 ns 220 ns 3188990 962.007 195
BM/988 229 ns 229 ns 3108513 988.004 200
BM/n
means that the main thread spins for n
cycles while the helper thread keeps spinning continuously in the background. The cycles measured are exactly equal to n
hence the helper thread is not included.
As another small check before setting inherit
to true
, I run the helper thread on core 8 (which corresponds to the same physical core as 2, i.e hyper-threads). The results are as expected:
-----------------------------------------------------------------------------
Benchmark Time CPU Iterations CYCLES INSTRUCTIONS
-----------------------------------------------------------------------------
BM/884 257 ns 257 ns 2725769 1.14975k 180
BM/910 266 ns 266 ns 2640910 1.18335k 185
BM/936 273 ns 273 ns 2566115 1.21762k 190
BM/962 286 ns 286 ns 2411894 1.25108k 195
BM/988 295 ns 295 ns 2433577 1.28563k 200
Finally, I reset inherit
back to true
as in the original google benchmark source. The helper thread is run on core 1 whereas the main thread is on core 2. As expected, the counts are clearly higher than just the main thread spins, showing that the cycles of the helper thread are also taken into account:
-----------------------------------------------------------------------------
Benchmark Time CPU Iterations CYCLES INSTRUCTIONS
-----------------------------------------------------------------------------
BM/884 202 ns 202 ns 3455988 1.7598k 617.728
BM/910 209 ns 209 ns 3316702 1.81128k 635.365
BM/936 216 ns 216 ns 3203655 1.86363k 653.338
BM/962 219 ns 219 ns 3182793 1.91633k 671.598
BM/988 226 ns 226 ns 3099125 1.96741k 689.28
spin
is defined as follows:
.global spin
.align 128
spin:
jmp j1
j1: jmp loop
.align 128
loop:
sub $26,%rdi
lfence
jle end
lfence
jmp loop
end:
ret
Not really sure why the man page says what it does but from the google benchmark comment + this small sanity check it seems to me that it should be fine to use inherit
together with PERF_FORMAT_GROUP
.
Upvotes: 1
Reputation: 22650
If you need to use PERF_FORMAT_GROUP
, and that doesn't work with the built-in inherit
, then you have to keep track of the children yourself. You can do that by using ptrace
and then setup perf_event_open
for all child tasks. Then you also have to merge the samples from all event file descriptors.
Edit:
Without PERF_FORMAT_GROUP
, the internal sampling recording is not at the same time. You could of course just setup counting events (instead of a sampling events), and read them at the same time in regular intervals from userspace.
Upvotes: 1