Frontier_Setter
Frontier_Setter

Reputation: 649

How to verify the granularity of memory access interleaving across different channels?

According to AMD's material, access to contiguous physical addresses will be interleaved across all memory channels (if set to NPS1). When a machine has 8 memory channels and the size of memory interleaving is set to 256 bytes, if the program intentionally accesses the first 256 bytes in every 2K space, it is expected that all memory access bandwidth will be concentrated on a specific channel.

However, I found through code testing that the memory access bandwidth is still evenly distributed across all memory channels.

Total Mem Bw (GB/s),26.34,0.23
Total Mem RdBw (GB/s),26.23,0.13
Total Mem WrBw (GB/s),0.11,0.10
Mem Ch-A RdBw (GB/s),3.27,0.02
Mem Ch-A WrBw (GB/s),0.01,0.01
Mem Ch-B RdBw (GB/s),3.27,0.02
Mem Ch-B WrBw (GB/s),0.01,0.01
Mem Ch-C RdBw (GB/s),3.29,0.02
Mem Ch-C WrBw (GB/s),0.01,0.01
Mem Ch-D RdBw (GB/s),3.29,0.02
Mem Ch-D WrBw (GB/s),0.02,0.01
Mem Ch-E RdBw (GB/s),3.27,0.02
Mem Ch-E WrBw (GB/s),0.01,0.01
Mem Ch-F RdBw (GB/s),3.27,0.02
Mem Ch-F WrBw (GB/s),0.01,0.01
Mem Ch-G RdBw (GB/s),3.28,0.02
Mem Ch-G WrBw (GB/s),0.02,0.01
Mem Ch-H RdBw (GB/s),3.28,0.02
Mem Ch-H WrBw (GB/s),0.01,0.01

Why is this?

Below is the core logic of my code.

#define INTERLEAVE_LEN 256
#define CHANNEL_NUM 8
#define STEP_LEN (128)

#define ADDR_BOUNDRY_INST (1UL << 7)    // instruction aligned to 128B
const uint64_t addr_mask_inst = ~(ADDR_BOUNDRY_INST-1);

while(!should_stop){

    for(uint64_t j = 0; j < INTERLEAVE_LEN/STEP_LEN; ++j){
        memread_128B_32Bstep((char*)(c_test_area+pos1));
        pos1 = (((pos1 + STEP_LEN) % cur_memsize) & cur_mask);
    }

    pos1 = (((pos1 + INTERLEAVE_LEN*(CHANNEL_NUM-1)) % cur_memsize) & cur_mask);
}

static inline void memread_128B_32Bstep(char* memarea)
{
    asm volatile(
        "mov    %[memarea], %%rax \n"   // rcx = reset loop iterator
        "vmovdqa 0*32(%%rax), %%ymm0 \n"
        "vmovdqa 1*32(%%rax), %%ymm0 \n"
        "vmovdqa 2*32(%%rax), %%ymm0 \n"
        "vmovdqa 3*32(%%rax), %%ymm0 \n"
        : 
        : [memarea] "r" (memarea)
        : "rax", "xmm0", "memory");
}

The virtual addresses are aligned to 4K, and the allocated pages are all 2M large pages.

#define ADDR_BOUNDRY (1UL << 12)    
const uint64_t addr_boundry = ADDR_BOUNDRY;
const uint64_t addr_mask = ~(ADDR_BOUNDRY-1);
char* tmp = (char*)malloc(mem_size + addr_boundry);

c_test_area = (char*)(((uint64_t)tmp + addr_boundry) & addr_mask);

I'm using AMD EPYC 7713 for test


more result update:

I disabled all hardware prefetcher in the BIOS and verified it using the hwpfdc metric in amduprof, but the results did not change.

I adjusted the granularity at each step, touching one cache line every 4K or 16K, but the results did not change.

I changed YMM to regular scalar instructions, and the results did not change.

Upvotes: 2

Views: 38

Answers (0)

Related Questions