user3527764
user3527764

Reputation: 115

perf_event_open: Including the execution of child process in case of sampling

According to man pages to consider the execution of child process when counting for events, inherit bit can be set. inherit can not be used when using PERF_FORMAT_GROUP. So, how can I include the execution of child process (execution of shell commands with in C source code, for example) so that it can be counted when sampling (PERF_FORMAT_GROUP) using perf_event_open

if PERF_FORMAT_GROUP is not specified, does this indicate that for each recorded sample, each sample record will include struct read_format equal to the number of events or a sample will be recorded for each event alone.

Upvotes: 4

Views: 768

Answers (2)

Box Box Box Box
Box Box Box Box

Reputation: 5241

As the question states the man pages state that inherit cannot work with PERF_FORMAT_GROUP:

Inherit does not work for some combinations of read_format values, such as PERF_FORMAT_GROUP.

Interestingly, there is a comment in google benchmark's code which states that they do indeed work fine together, in contradiction with the man page.

// We then proceed to populate the remaining fields in our attribute struct
// Note: the man page for perf_event_create suggests inherit = true and
// read_format = PERF_FORMAT_GROUP don't work together, but that's not the
// case.
attr.disabled = is_first;
attr.inherit = true;

So the code does use the inherit option with PERF_FORMAT_GROUP.

As a sanity check I've written a small program to check whether it works - I'm too lazy to figure out the exact way of using perf_event_open so I've simply used google benchmark, and to test when inherit is false I've simply set it to false in the above code and recompiled google benchmark.

#include <benchmark/benchmark.h>
#include <bits/stdc++.h>
#include <pthread.h>

using namespace std;

extern "C" void spin(int);

int ter = 0;

void* helper_thread(void *args) {
        cpu_set_t cpuset;
        CPU_ZERO(&cpuset);
        CPU_SET(1, &cpuset);
        assert(pthread_setaffinity_np(pthread_self(), sizeof(cpuset), &cpuset) == 0);

        while (ter == 0) {
                spin(26);
        }
        return NULL;
}

void BM(benchmark::State &state) {
        cpu_set_t cpuset;
        CPU_ZERO(&cpuset);
        CPU_SET(2, &cpuset);
        assert(pthread_setaffinity_np(pthread_self(), sizeof(cpuset), &cpuset) == 0);

        ter = 0;
        pthread_t thr;
        assert(pthread_create(&thr, NULL, helper_thread, NULL) == 0);
        spin(5000);

        for (auto _ : state) {
                spin(state.range(0));
        }

        ter = -1;
        pthread_join(thr, NULL);
}

BENCHMARK(BM)->ArgsProduct({
                benchmark::CreateDenseRange(884, 988, 26)
                });

BENCHMARK_MAIN();

In the program I run two threads which both just keep spinning. void spin(int) is just a small function which spins for the number of cycles passed as an argument, which I have tuned for my system (11th Gen Intel(R) Core(TM) i5-11400H). The two threads are pinned to different cores.

First, I'm using google benchmark with inherit set to false. On my system, cores 1 & 2 correspond to different physical cores, and I've isolated core 2 as well. As inherit is false the PMU counters should only measure the cycles of the main thread, which is what is observed:

-----------------------------------------------------------------------------
Benchmark           Time             CPU   Iterations     CYCLES INSTRUCTIONS
-----------------------------------------------------------------------------
BM/884            202 ns          202 ns      3463599    884.004          180
BM/910            208 ns          208 ns      3368427     910.06          185
BM/936            214 ns          214 ns      3262906    936.159          190
BM/962            220 ns          220 ns      3188990    962.007          195
BM/988            229 ns          229 ns      3108513    988.004          200

BM/n means that the main thread spins for n cycles while the helper thread keeps spinning continuously in the background. The cycles measured are exactly equal to n hence the helper thread is not included.

As another small check before setting inherit to true, I run the helper thread on core 8 (which corresponds to the same physical core as 2, i.e hyper-threads). The results are as expected:

-----------------------------------------------------------------------------
Benchmark           Time             CPU   Iterations     CYCLES INSTRUCTIONS
-----------------------------------------------------------------------------
BM/884            257 ns          257 ns      2725769   1.14975k          180
BM/910            266 ns          266 ns      2640910   1.18335k          185
BM/936            273 ns          273 ns      2566115   1.21762k          190
BM/962            286 ns          286 ns      2411894   1.25108k          195
BM/988            295 ns          295 ns      2433577   1.28563k          200

Finally, I reset inherit back to true as in the original google benchmark source. The helper thread is run on core 1 whereas the main thread is on core 2. As expected, the counts are clearly higher than just the main thread spins, showing that the cycles of the helper thread are also taken into account:

-----------------------------------------------------------------------------
Benchmark           Time             CPU   Iterations     CYCLES INSTRUCTIONS
-----------------------------------------------------------------------------
BM/884            202 ns          202 ns      3455988    1.7598k      617.728
BM/910            209 ns          209 ns      3316702   1.81128k      635.365
BM/936            216 ns          216 ns      3203655   1.86363k      653.338
BM/962            219 ns          219 ns      3182793   1.91633k      671.598
BM/988            226 ns          226 ns      3099125   1.96741k       689.28

spin is defined as follows:

.global spin
.align 128
spin:
        jmp j1
j1:     jmp loop
.align 128
loop:
        sub $26,%rdi
        lfence
        jle end
        lfence
        jmp loop
end:
        ret

Not really sure why the man page says what it does but from the google benchmark comment + this small sanity check it seems to me that it should be fine to use inherit together with PERF_FORMAT_GROUP.

Upvotes: 1

Zulan
Zulan

Reputation: 22650

If you need to use PERF_FORMAT_GROUP, and that doesn't work with the built-in inherit, then you have to keep track of the children yourself. You can do that by using ptrace and then setup perf_event_open for all child tasks. Then you also have to merge the samples from all event file descriptors.

Edit: Without PERF_FORMAT_GROUP, the internal sampling recording is not at the same time. You could of course just setup counting events (instead of a sampling events), and read them at the same time in regular intervals from userspace.

Upvotes: 1

Related Questions