Reputation: 51
Recently I am trying to do some forwarding test with DPDK "testpmd" application. And I find something interesting.
When 512 descriptors are used for TX and RX, the performance is better than using 4096 descriptors. After checking the counters with perf
command, I find that a huge number of "dTLB-load-misses" is observed. And it is about more than 100 times of that with 512 descriptors. But the page-faults are always zero. With the ":u" and ":k" arguments, it seems that most of the TLB misses are in the user space. All the buffers are in one huge page for storing the data of network payloads, and the hugepage is 512MB size. Each buffer is less than 3KB. The buffer and the descriptors are one-to-one map.
So is there any clue that I can find the huge number of TLB misses? And will it have some effect to the performance (degradation)?
Upvotes: 3
Views: 675
Reputation: 8544
In general, CPU TLB cache capacity depends on page size. This means that for 4KB pages and for 512MB pages there are might be different number L1/L2 TLB cache entries.
For example, for ARM Cortex-A75:
The data micro TLB is a 48-entry fully associative TLB that is used by load and store operations. The cache entries have 4KB, 16KB, 64KB, and 1MB granularity of VA to PA mappings only.
Source: ARM Info Center
For ARM Cortex-A55:
The Cortex-A55 L1 data TLB supports 4KB pages only. Any other page sizes are fractured after the L2 TLB and the appropriate page size sent to the L1 TLB.
Source: ARM Info Center
Basically, this means that the 512MB huge page mappings will be fractured to some smaller size (down to 4K) and only those small pieces will be cached in L1 dTLB.
So even if your application is fit into a single 512MB page, still the performance will depend greatly on actual memory footprint.
Upvotes: 3