dangzzz
dangzzz

Reputation: 199

What is the latency of `clwb` and `ntstore` on Intel's Optane Persistent Memory?

In this paper, it is written that the 8 bytes sequential write of clwb and ntstore of optane PM have 90ns and 62ns latency, respectively, and sequential reading is 169ns.

But in my test with Intel 5218R CPU, clwb is about 700ns and ntstore is about 1200ns. Of course, there is a difference between my test method and the paper, but the result is too bad, which is unreasonable. And my test is closer to actual usage.

During the test, did the Write Pending Queue of CPU's iMC or the WC buffer in the optane PM become the bottleneck, causing blockage, and the measured latency has been inaccurate? If this is the case, is there a tool to detect it?

#include "libpmem.h"
#include "stdio.h"
#include "x86intrin.h"

//gcc aep_test.c -o aep_test -O3 -mclwb -lpmem

int main()
{
    size_t mapped_len;
    char str[32];
    int is_pmem;
    sprintf(str, "/mnt/pmem/pmmap_file_1");
    int64_t *p = pmem_map_file(str, 4096 * 1024 * 128, PMEM_FILE_CREATE, 0666, &mapped_len, &is_pmem);
    if (p == NULL)
    {
        printf("map file fail!");
        exit(1);
    }
    if (!is_pmem)
    {
        printf("map file fail!");
        exit(1);
    }

    struct timeval start;
    struct timeval end;
    unsigned long diff;
    int loop_num = 10000;

    _mm_mfence();
    gettimeofday(&start, NULL);

    for (int i = 0; i < loop_num; i++)
    {
        p[i] = 0x2222;
        _mm_clwb(p + i);
        // _mm_stream_si64(p + i, 0x2222);
        _mm_sfence();
    }

    gettimeofday(&end, NULL);

    diff = 1000000 * (end.tv_sec - start.tv_sec) + end.tv_usec - start.tv_usec;

    printf("Total time is %ld us\n", diff);
    printf("Latency is %ld ns\n", diff * 1000 / loop_num);

    return 0;
}

Any help or correction is much appreciated!

Upvotes: 1

Views: 685

Answers (2)

grayxu
grayxu

Reputation: 144

  1. The main reason is repeating flush to the same cacheline is delayed dramatically[1].
  2. You are testing the avg latency instead of best-case latency like the FAST20 papaer.
  3. ntstore are more expensive than clwb, so it's latency is higher. I guess it's a typo in your first paragraph.

appended on 4.14

Q: Tools to detect possible bottleneck on WPQ of buffers?
A: You can get a baseline when PM is idle, and use this baseline to indicate the possible bottleneck.
Tools:

  1. Intel Memory Bandwidth Monitoring
  2. Reads Two hardware counters from performance monitoring unit (PMU) in the processor: 1) UNC_M_PMM_WPQ_OCCUPANCY.ALL, which counts the accumulated number of WPQ entries at each cycle and 2) UNC_M_PMM_WPQ_INSERTS, which counts how many entries have been inserted into WPQ. And the calculate the queueing delay of WPQ: UNC_M_PMM_WPQ_OCCUPANCY.ALL / UNC_M_PMM_WPQ_INSERTS. [2]

[1] Chen, Youmin, et al. "Flatstore: An efficient log-structured key-value storage engine for persistent memory." Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems. 2020.
[2] Imamura, Satoshi, and Eiji Yoshida. “The analysis of inter-process interference on a hybrid memory system.” Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Workshops. 2020.

Upvotes: 3

Peter Cordes
Peter Cordes

Reputation: 364220

https://www.usenix.org/system/files/fast20-yang.pdf describes what they're measuring: the CPU side of doing one store + clwb + mfence for a cached write1. So the CPU-pipeline latency of getting a store "accepted" into something persistent.

This isn't the same thing as making it all the way to the Optane chips themselves; the Write Pending Queue (WPQ) of the memory controllers are part of the persistence domain on Cascade Lake Intel CPUs like yours; wikichip quotes an Intel image:

enter image description here

Footnote 1: Also note that clwb on Cascade Lake works like clflushopt - it just evicts. So store + clwb + mfence in a loop test would test the cache-cold case, if you don't do something to load the line before the timed interval. (From the paper's description, I think they do). Future CPUs will hopefully properly support clwb, but at least CSL got the instruction supported so future libraries won't have to check CPU features before using it.


You're doing many stores, which will fill up any buffers in the memory controller or elsewhere in the memory hierarchy. So you're measuring throughput of a loop, not latency of one store plus mfence itself in a previously-idle CPU pipeline.

Separate from that, rewriting the same line repeatedly seems to be slower than sequential write, for example. This Intel forum post reports "higher latency" for "flushing a cacheline repeatedly" than for flushing different cache lines. (The controller inside the DIMM does do wear leveling, BTW.)

Fun fact: later generations of Intel CPUs (perhaps CPL or ICX) will have even the caches (L3?) in the persistence domain, hopefully making clwb even cheaper. IDK if that would affect back-to-back movnti throughput to the same location, though, or even clflushopt.


During the test, did the Write Pending Queue of CPU's iMC or the WC buffer in the optane PM become the bottleneck, causing blockage, and the measured latency has been inaccurate?

Yes, that would be my guess.

If this is the case, is there a tool to detect it?

I don't know, sorry.

Upvotes: 2

Related Questions