FIO latency percentile changes over time

Question

I want to measure and plot the latency percentile change over time for an SSD. If anyone did something similar please share any advice you might have. I am interested in both how to run FIO and how to process the results.

I will describe first the testing methodology I want to use, then describe what I have done so far (and works imperfectly), and finally ask a couple of questions.

Goal:

I want to keep track of latency average and 95%, 99%, 99.9% latency percentiles over time. Obviously, these measures are implicitly defined over a time window that I would like to be able to set to something like 10-60s intervals.
I want to compare how these latency percentiles change as I vary the IO pattern at a constant device load. I need to be able to control the total load (the amount of data send to the device) to make sure that the percentiles are actually comparable. A simple example would be: a) have a single thread that writes sequentially 200MB/s vs. b) 2 threads that write 100MB/s. It would be meaningless to compare percentiles if the total throughput between the two experiments were different.

What I tried so far:

Custom version of FIO to increase the resolution of the latency histograms. This is probably not needed.
I turned on json+ output so that I get the nice latency histograms. However, these histograms aggregate the whole FIO run so I have no way to measure the latency change over time.
To get the latency change over time, I thought of starting many small FIO jobs one after another. For example, if I want to get the latency over 1h, I start 120 FIO runs of 30s and save each output to a different file. Each output would give me the latency percentiles over 30s. However, there are two problems with this approach:
1. There is a long time required for the FIO startup (about 15-20s) and these allows the SSD to perform GC and repair write performance.
2. For sequential writes, the writing offset is reset at the start of each FIO job. This means that the new FIO run does not actually continue writing sequentially and, even worse, some portions of the device might not be written at all.

Questions:

Is there a method to use FIO to keep track of latency changes over time. If so, could you please provide an example?
For sequential writes, how could I increase throughput? By default, FIO for sequential writes uses iodepth 1 (queue depth 1). I don't see a clear way of increasing throughput over that. Increasing the iodepth does not seem to help.
I saw there are some python scripts in the FIO git repo for plotting. Would any of these be useful? Could anyone point me to some example that resembles what I want to do?

FIO latency percentile changes over time

Answers (1)

Related Questions