Reputation: 1569
i came across this GitHub repo, which links to this blog post about cache aware programming. It compares the number of cache-references and cache-misses in perf
output depending on the spread of data inside of an array in relation to the cache line size. I wanted to perform a similar experiment on my own. The code i came up with looks like this:
// depth_first.c
#include <stdint.h>
#include <stdlib.h>
#define ROWS 100000
#define COLS 64
int main() {
uint8_t (*mat)[COLS] = malloc(sizeof(uint8_t[ROWS][COLS])); // matrix
uint8_t sum = 0;
for(int row_idx = 0; row_idx < ROWS; row_idx++) {
for(int col_idx = 0; col_idx < COLS; col_idx++) {
sum += mat[row_idx][col_idx];
}
}
free(mat);
}
(I know i am not initializing the array after malloc
which is UB, however i am not interested in the actual values and i assume it will not impact the cache performance)
I dynamically allocate a 2D array of uint8_t
called mat
. Its dimensions are 100000 rows, 64 columns each. As far as i understand, this means that the memory layout of the array is as follows:
[ Row 0, 64 bytes ][ Row 1, 64 bytes ][ Row 2, 64 bytes ]...[ Row 99999, 64 bytes ]
Each row occupies a continuous area of memory. I also specifically chose 64 bytes as it is the size of a cache line on my CPU. This means that a row should perfectly fit inside of a cache line.
Now, my experiment is as follows: i iterate through the array "depth first", by visiting every column inside of the first row, before moving to the second one etc. as seen in the original code above. Then i modify the code to access the array "breadth first", by iterating through first element of every row, then second element of every row, etc:
// breadth_first.c for loop
for(int col_idx = 0; col_idx < COLS; col_idx++) { // col_idx in outer loop
for(int row_idx = 0; row_idx < ROWS; row_idx++) { // row_dix in inner
sum += mat[row_idx][col_idx];
}
}
I compile both versions with no optimizations:
gcc -O0 breadth_first.c -o breadth_first
gcc -O0 depth_first.c -o depth_first
And test them using perf
:
perf stat -e cache-references,cache-misses ./breadth_first
perf stat -e cache-references,cache-misses ./depth_first
The output i receive looks as follows (the numbers and percentages vary only a little between iterations):
Performance counter stats for './breadth_first':
12 654 452 cache-references:u
106 456 cache-misses:u # 0,841 % of all cache refs
0,015068004 seconds time elapsed
0,015102000 seconds user
0,000000000 seconds sys
Performance counter stats for './depth_first':
213 178 cache-references:u
5 901 cache-misses:u # 2,768 % of all cache refs
0,026617312 seconds time elapsed
0,026690000 seconds user
0,000000000 seconds sys
Now what i expected to see was similar number of cache-references in both, with a larger percentage/number of cache-misses in the breadth_first
case. I expected similar number of references in both, since both versions of code do the same "thing" and perform same number of accesses into the 2d array.
Instead, the number of cache-references grew immensely. While the cache-misses also grew, the percentage was actually better in the case of breadth_first
(probably due to large number of references, so the percentage alone is not indicative of anything).
Now my question is, what causes the increase in cache-references?
I am testing this on an AMD CPU, in case this matters.
I located the mapping in Kernel source from event id to what i assume is a hardware event id. Am now trying to navigate the AMD PPR section 2.1.14 to find the description of the underlying hardware event. I will still appreciate an informed answer to the question.
Upvotes: 0
Views: 198
Reputation: 1569
Self-answering my question after i had a bit more time to research the topic. Thanks to @Ladislus and @Peter Cordes for commenting, as their hints led me in the right direction. The answer will be a bit lengthy, as the question was for an explanation of a phenomenon, instead of a problem solution and i want to give a good explanation of what was happening.
To begin with, CPU i am currently testing this on is the AMD 4650U mobile APU. An excerpt from /proc/cpuinfo
clarifies some important specifics:
vendor_id : AuthenticAMD
cpu family : 23
model : 96
model name : AMD Ryzen 5 PRO 4650U with Radeon Graphics
For the record, this may be a different CPU from the one i was using when asking the question, but it is not important. CPU family is 23, so 17h in hexadecimal and model is 96 which is 60h. This is important in deciding where to look for information.
Since i am on AMD i begun by looking in AMD PPR. At the time of writing of this answer there are multiple versions of the PPR targeted at different families and model ranges within those families. I obviously consulted the PPR meant for family 17h and model 60h. Most information should be there, mostly in 2.1.14 "MSR registers" and 2.1.15 "Performance Monitor Counters". It feels like the most "official" source for this kind of info, as hinted by man perf-list
in the "RAW HARDWARE EVENT DESCRIPTOR" section:
For instance on x86 CPUs, N is a hexadecimal value that represents the
raw register encoding with the layout of IA32_PERFEVTSELx MSRs (see
[Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume
3B: System Programming Guide] Figure 30-1 Layout of IA32_PERFEVTSELx
MSRs) or AMD’s PERF_CTL MSRs (see the Core Complex (CCX) → Processor
x86 Core → MSR Registers section of the [AMD Processor Programming
Reference (PPR)] relevant to the family, model and stepping of the
processor being used).
Sadly, i found PPR to be difficult to navigate and could not really decode any info from it.
In the end i found most use in Linux kernel sources. Particularly, the /arch/x86/events/amd/core.c
file (mirror) contained mappings from events to Event ids and UMasks in hex, while files in /tools/perf/pmu-events/arch/x86/amdzen2/
(especially core.json, cache.json and memory.json) listed all the events. The directory amdzen2 was chosen due to the microarch of my CPU being Zen2, but it may also be found by inspecting the mapfile.csv which matches specific directory based on (what looks like) info from /proc/cpuinfo
and a regex.
As for the events, they are encoded as 4-digit hexadecimal numbers. From what i have gathered, first 2 digits are a UMask, while the second pair is the Event id. In core.c i have found mappings from Linux event names to hardware event ids (for family 17h):
static const u64 amd_f17h_perfmon_event_map[PERF_COUNT_HW_MAX] =
{
[PERF_COUNT_HW_CPU_CYCLES] = 0x0076,
[PERF_COUNT_HW_INSTRUCTIONS] = 0x00c0,
[PERF_COUNT_HW_CACHE_REFERENCES] = 0xff60,
[PERF_COUNT_HW_CACHE_MISSES] = 0x0964,
[PERF_COUNT_HW_BRANCH_INSTRUCTIONS] = 0x00c2,
[PERF_COUNT_HW_BRANCH_MISSES] = 0x00c3,
[PERF_COUNT_HW_STALLED_CYCLES_FRONTEND] = 0x0287,
[PERF_COUNT_HW_STALLED_CYCLES_BACKEND] = 0x0187,
};
It seems that cache references maps to the id 0xff60 and cache misses map to 0x0964. This can be verified by running perf stat
with both the human-friendly name of the event as well as the hardware id, the counts are identical:
perf stat -e cache-references,rff60,cache-misses,r0964 ./breadth
Performance counter stats for './breadth':
1 260 682 257 cache-references:u
1 260 682 257 rff60:u
10 728 501 cache-misses:u # 0,851 % of all cache refs
10 728 501 r0964:u
This additionally explains why the count of references increased between depth-first and breadth-first search. As guessed in the comments, cache-references event includes both reads and writes to the cache, as well as some other metrics. To be specific, cache.json begins with a list of events with Event id 0x60. Each has a single-bit UMask:
[
{
"EventName": "l2_request_g1.rd_blk_l",
"EventCode": "0x60",
"BriefDescription": "All L2 Cache Requests (Breakdown 1 - Common). Data cache reads (including hardware and software prefetch).",
"UMask": "0x80"
},
{
"EventName": "l2_request_g1.rd_blk_x",
"EventCode": "0x60",
"BriefDescription": "All L2 Cache Requests (Breakdown 1 - Common). Data cache stores.",
"UMask": "0x40"
}, // Many more events here...
]
There is not a singular event with Event id 0x60 which would have a mask of 0xff, which makes me assume that the mask is an AND combination of multiple single-bit masks. Overall, this means that the cache-references event is a sum of all the events with Event id 0x60. Some of those include stores, shared access and even instruction cache metrics. This answers why the count of references changed between the two programs.
Afterwards i decided to find metrics which may be better for the experiment. Thanks to the comments i decided to use L1-dcache-loads and L1-dcache-load-misses. These produced results which seem closer to what i was expecting:
perf stat -e L1-dcache-loads,L1-dcache-load-misses,r0040,rc860 ./depth
Performance counter stats for './depth':
1 280 108 192 L1-dcache-loads:u
14 834 L1-dcache-load-misses:u # 0,00% of all L1-dcache accesses
1 280 108 192 r0040:u
14 834 rc860:u
0,340287009 seconds time elapsed
0,340152000 seconds user
0,000000000 seconds sys
perf stat -e L1-dcache-loads,L1-dcache-load-misses,r0040,rc860 ./breadth
Performance counter stats for './breadth':
1 281 993 990 L1-dcache-loads:u
1 635 879 L1-dcache-load-misses:u # 0,13% of all L1-dcache accesses
1 281 993 990 r0040:u
1 635 879 rc860:u
0,464484543 seconds time elapsed
0,464455000 seconds user
0,000000000 seconds sys
As to where did i find the hardware ids of said events, ids of 0x0040 and 0xc860 were found in core.c on line 131:
[C(L1D)] = {
[C(OP_READ)] = {
[C(RESULT_ACCESS)] = 0x0040, /* Data Cache Accesses */
[C(RESULT_MISS)] = 0xc860, /* L2$ access from DC Miss */
}
// Many more elements
}
0xc860 includes only three events with Event id 0x60 - with UMasks 0x80, 0x40 and 0x08. This is basically a smaller subset of the 0xff60 event. The 0x0040 is also named "ls_dc_accesses" in the kernel and can be found in memory.json on line 87.
While i am happy with the results for now, i am aware that these two events are not exactly most accurate. The 0xc860 still includes some writes and 0x0040 is described as "speculative" (which i am not yet sure what it means in this context). Overall i will probably play around more with various metrics to see which produces most accurate representation of what i want to know about the cache. Despite this, the question was mostly about the spike in cache-references and i decided that i have enough information and sources to answer it.
Upvotes: 2