Reputation: 2211
I discovered that on my Ubuntu 22 server, attempting to write to a file, often induces around 20ms delay, even when only writing a few bytes.
Here is some basic code that demonstrates the problem:
#include <iostream>
#include <fstream>
#include <chrono>
#include <filesystem>
#include <thread>
using namespace std;
void go_c() {
FILE *out = fopen("hello.txt", "w");
fputs("hello", out);
fclose(out);
}
void go_cpp () {
ofstream out("hello.txt");
out<<"hello"<<endl;
}
double test(void (*f)()) {
typedef chrono::time_point <chrono::steady_clock> tp;
tp t0 = chrono::steady_clock::now();
f();
tp t1 = chrono::steady_clock::now();
return chrono::duration<double>(t1-t0).count() * 1000; //milliseconds
}
void bench(void (*f)(), const char *txt, int delay_ms) {
filesystem::remove("hello.txt");
for (int i=0;i<5;i++) {
double t = test(f);
cerr<<i<<": "<<txt<<", time = "<<t<<" ms"<<endl;
this_thread::sleep_for(std::chrono::milliseconds(delay_ms));
}
cerr<<endl;
}
int main () {
bench(go_c, "C Write", 0);
bench(go_cpp, "C++ Write", 0);
bench(go_c, "C Write with delay", 2500);
bench(go_cpp, "C++ Write with delay", 2500);
return 0;
}
And here is the output:
ubuntu@captain:~/scratch$ g++ -o write3 write3.cpp -O2 -Wall
ubuntu@captain:~/scratch$ ./write3
0: C Write, time = 0.09978 ms
1: C Write, time = 21.9316 ms
2: C Write, time = 0.185957 ms
3: C Write, time = 0.140212 ms
4: C Write, time = 0.139051 ms
0: C++ Write, time = 0.145766 ms
1: C++ Write, time = 0.091845 ms
2: C++ Write, time = 0.139618 ms
3: C++ Write, time = 0.130834 ms
4: C++ Write, time = 0.132217 ms
0: C Write with delay, time = 0.048674 ms
1: C Write with delay, time = 0.23875 ms
2: C Write with delay, time = 20.8626 ms
3: C Write with delay, time = 8.4307 ms
4: C Write with delay, time = 19.4026 ms
0: C++ Write with delay, time = 17.1555 ms
1: C++ Write with delay, time = 17.5887 ms
2: C++ Write with delay, time = 18.9792 ms
3: C++ Write with delay, time = 25.8653 ms
4: C++ Write with delay, time = 20.7998 ms
It seems more likely to happen if there is a bit of delay between attempts, and also more likely to happen if the file already exists.
(It seems I can improve my server performance by deleting a file before I write to it. This seems illogical.)
The hardware is a 6-core Xeon 2386G with dual NVMe SSDs on soft Raid, running almost nothing else and no other users.
See attached screenshot showing the result of uname -a
and dpkg --list | grep -E "libc6?-(dev|bin)"
Upvotes: 15
Views: 962
Reputation: 50698
This answer is meant to provide information about the observed behavior and to inspect what is going on.
TL;DR: the problem clearly comes from the active power state of the NVMe device that can be tuned so not to pay a huge wake-up overhead. This overhead is delayed because of asynchronous IO operations and delayed waits done in the Linux kernel. Please read Section "NVMe device power management" (at the end) so to fix this.
First of all, I can partially reproduce effect on my machine (Debian Linux kernel 5.15.0-2 with a 6-core Intel i5-9600KF and a NVMe "Samsung 970 Evo Plus" SSD) in some conditions. More specifically, the latency of ~6 ms is visible only during the second C test (2: C Write, time =
) when I do not call GCC just before so to generate the binary. All timings are smaller than 1 ms otherwise (generally in the range 0.1~0.4 ms). Updating the kernel to the version 5.18.0-2 reduced the latency to 2.5 ms (+/- 0.5 ms). The timings are sufficiently stable and deterministic so to report such information.
A quick analysis using the great Linux perf tool shows that a significant portion of the time is spend in the kernel scheduling-related functions when the latency spike happens (using perf record
). Hopefully, we can analyze what the scheduler does with perf too. The command perf sched record ./a.out
can be used to record scheduling information and perf sched timehist -Vg
help you to show the schedule (note that the perf command is sometime suffixed by the version on some systems like perf_5.4
). Here is an example on my machine:
time cpu 0123456 task name wait time sch delay run time
[tid/pid] (msec) (msec) (msec)
--------------- ------ ------- ------------------------------ --------- --------- ---------
95730.123845 [0005] i <idle> 0.000 0.000 0.000
95730.123845 [0002] s perf_5.4[55033] 0.000 0.000 0.000
95730.124721 [0000] i <idle> 0.000 0.000 0.000
95730.124725 [0000] s gmain[805/804] 0.000 0.025 0.004
95730.124761 [0000] i <idle> 0.004 0.000 0.036
95730.124765 [0000] s gmain[805/804] 0.036 0.025 0.003
95730.124789 [0003] i <idle> 0.000 0.000 0.000
95730.124795 [0003] s kworker/u12:2-e[45157] 0.000 0.023 0.006
95730.124822 [0002] i <idle> 0.000 0.000 0.976
95730.124833 [0000] i <idle> 0.003 0.000 0.068
95730.124834 [0005] s a.out[55040] 0.000 0.026 0.988
95730.124836 [0000] s | gmain[805/804] 0.068 0.038 0.003
95730.124838 [0002] s | sudo[54745] 0.000 0.028 0.015
95730.124849 [0003] i | <idle> 0.006 0.000 0.053
95730.124854 [0003] s | kworker/u12:2-e[45157] 0.053 0.027 0.004
95730.124886 [0002] i | <idle> 0.015 0.000 0.048
95730.124901 [0002] s | sudo[54745] 0.048 0.033 0.015
95730.124916 [0003] i | <idle> 0.004 0.000 0.062
95730.124922 [0003] s | kworker/u12:2-e[45157] 0.062 0.024 0.005
95730.124945 [0004] i| <idle> 0.000 0.000 0.000
95730.124987 [0004] s| gnome-terminal-[8464] 0.000 0.024 0.042
95730.127461 [0003] i | <idle> 0.005 0.000 2.539
95730.127474 [0005] i <idle> 0.988 0.000 2.639
95730.127475 [0003] s kworker/u12:2-e[45157] 2.539 0.023 0.013
95730.127516 [0000] i | <idle> 0.003 0.000 2.679
95730.127519 [0000] s | gmain[805/804] 2.679 0.027 0.003
95730.127530 [0005] | s a.out[55040] 2.639 0.001 0.056
95730.127537 [0003] i | <idle> 0.013 0.000 0.062
95730.127549 [0005] i <idle> 0.056 0.000 0.018
95730.127550 [0003] s kworker/u12:2-e[45157] 0.062 0.023 0.013
95730.127566 [0004] i <idle> 0.042 0.000 2.578
95730.127568 [0004] s kworker/u12:4-e[54041] 0.000 0.026 0.002
95730.127585 [0002] i <idle> 0.015 0.000 2.683
95730.127585 [0000] i <idle> 0.003 0.000 0.065
95730.127588 [0000] s gmain[805/804] 0.065 0.026 0.003
95730.127591 [0005] s a.out[55040] 0.018 0.001 0.042
95730.127595 [0002] s | sudo[54745] 2.683 0.043 0.009
95730.127605 [0004] i| <idle> 0.002 0.000 0.037
95730.127617 [0005] i <idle> 0.042 0.000 0.026
95730.127618 [0004] s kworker/u12:4-e[54041] 0.037 0.028 0.013
95730.127635 [0003] i <idle> 0.013 0.000 0.085
95730.127637 [0003] s kworker/u12:2-e[45157] 0.085 0.027 0.002
95730.127644 [0003] i <idle> 0.002 0.000 0.007
95730.127647 [0003] s kworker/u12:2-e[45157] 0.007 0.000 0.002
95730.127650 [0003] i <idle> 0.002 0.000 0.003
95730.127652 [0003] s kworker/u12:2-e[45157] 0.003 0.000 0.001
95730.127653 [0003] i <idle> 0.001 0.000 0.001
95730.127659 [0003] s kworker/u12:2-e[45157] 0.001 0.000 0.006
95730.127662 [0002] i <idle> 0.009 0.000 0.067
95730.127662 [0000] i <idle> 0.003 0.000 0.073
95730.127666 [0000] s gmain[805/804] 0.073 0.036 0.003
95730.127669 [0003] i <idle> 0.006 0.000 0.010
95730.127672 [0004] i <idle> 0.013 0.000 0.053
95730.127673 [0003] s kworker/u12:2-e[45157] 0.010 0.000 0.004
95730.127674 [0004] s kworker/u12:4-e[54041] 0.053 0.026 0.002
95730.127676 [0004] i <idle> 0.002 0.000 0.001
95730.127678 [0004] s kworker/u12:4-e[54041] 0.001 0.001 0.002
95730.127679 [0002] s sudo[54745] 0.067 0.052 0.016
95730.127692 [0001] i <idle> 0.000 0.000 0.000
95730.127717 [0001] s gnome-terminal-[8464] 2.704 0.019 0.024
95730.127725 [0005] s a.out[55040] 0.026 0.001 0.107
95730.127755 [0002] i | <idle> 0.016 0.000 0.075
The time and the left is in seconds, the 0123456
column show the schedule of the active tasks on the cores. s
means the task is scheduled and i
means it is interrupted. I added |
symbols so to better understand when the tracked process is running (a.out
is the program executing your code). The execution time printed by the program is 2.68278 ms
so we are basically searching for a 0.0027 second gap in the timestamps (but I find the idle timestamps unreliable since they appear to indicate the end of the idle time instead of the beginning of the idle time -- when the task is suspended).
The schedule shows that the process runs for 0.988 ms, then is interrupted for 2.639 ms, then runs again for 0.056 ms, then is again interrupted for 0.018 ms and runs again, etc. (the rest of the execution is not shown for sake of clarity). The first interruption match very well with the reported spike timing (especially since we should include the time for the process to warm up again and the scheduler to do the context switch).
During the interruption of the program, two tasks are awaken: a kernel thread called kworker/u12:2-e
and a task called gmain
which is certainly gnome-shell. The kernel thread starts when the program is interrupted and is interrupted when the program is resumed (with a 7 us delay). Also please note that the kernel thread takes 2.539
to run.
Perf can provide some information about kernel calls. You do that by adding the option --kernel-callchains --call-graph dwarf
to sched record
. Unfortunately, the result are not very useful in this case. The only useful information is that the kernel function io_schedule <- folio_wait_bit_common <- folio_wait_writeback <- truncate_inode_partial_folio
are called when the program is being interrupted during the spike. This proves the program is interrupted because of an IO operation. You can also add the --wakeups
flag so to see the wakeup event and see that the suspicious slow kernel thread is awaken by the target program (and the previous one are awaken by other tasks (typically gmain
or gnome-terminal
).
strace -T ./a.out
can be used to track the time of system calls and we can clearly see that the third call to openat
is slow on my machine. Here is the interesting part (reformatted for sake of clarity):
unlink: 0.000072 s
openat: 0.000047 s
newfstatat: 0.000007 s
write: 0.000044 s
close: 0.000006 s
[...] (write x 7)
openat: 0.000019 s
newfstatat: 0.000005 s
write: 0.000011 s
close: 0.000022 s
[...] (write x 7)
openat: 0.002334 s <----- latency spike
newfstatat: 0.000057 s
write: 0.000080 s
close: 0.000052 s
[...] (write x 7)
openat: 0.000021 s
newfstatat: 0.000005 s
write: 0.000029 s
close: 0.000014 s
[...]
Based on the gathered information, we can clearly say that the system calls like openat
or close
do not always causes the program to: 1. be interrupted and 2. start a kernel thread doing the actual operation on the SSD. Instead, IO calls appears to be somehow aggregated/cached in RAM, and the completion/synchronization on the SSD is only done at specific moments. The latency spike only happens when a kernel threads does the work and the task is interrupted. My guess is that IO operations are done in RAM (certainly asynchronously) and the kernel sometimes flush/sync in-RAM data to the SSD and this is what takes few milliseconds. The reason for such delay is unclear. Anyway, it means the operation is probably latency bound.
To understand what is exactly going on, one unfortunately need to trace kernel threads and possibly even the SSD driver stack which is a bit tricky. The simplest way to do that seems to use the Linux Function Tracer (aka. ftrace
). Note that tracing all kernel functions is very expensive and hide the cost of expensive functions so the granularity should be adjusted. The kernel trace are quickly monstrously big and function names are often not very helpful. On top of that tracing kernel threads is not easy because its pid is unknown before the request is made and operations are done in multi-threaded context (and concurrently on each core due to context-switches).
I tried it on my machine and here is a simplified profiling trace (with only calls >=1us and no IRQ/fault kernel calls for sake of clarity) of the program:
__x64_sys_unlink(); 92.569 us
__x64_sys_openat(); 48.103 us
__x64_sys_newfstatat(); 2.609 us
__x64_sys_write(); 33.606 us
exit_to_user_mode_prepare(); 12.517 us
__x64_sys_write(); 8.277 us
__x64_sys_write(); 2.482 us
__x64_sys_write(); 2.257 us
__x64_sys_write(); 2.240 us
__x64_sys_write(); 5.987 us
__x64_sys_write(); 5.090 us
__x64_sys_openat(); 77.006 us <------ Fast
__x64_sys_newfstatat(); 2.433 us
__x64_sys_write(); 43.509 us
exit_to_user_mode_prepare(); 83.260 us
__x64_sys_write(); 5.688 us
__x64_sys_write(); 6.339 us
__x64_sys_write(); 4.521 us
__x64_sys_write(); 3.006 us
__x64_sys_write(); 4.309 us
__x64_sys_write(); 3.472 us
__x64_sys_write(); 2.669 us
__x64_sys_openat() {
[CONTEXT SWITCH: a.out-73884 => <idle>-0]
[MISSING PART: KERNEL THREAD]
[CONTEXT SWITCH: <idle>-0 => a.out-73884]
} 2441.794 us <----- Latency spike
__x64_sys_newfstatat(); 3.007 us
__x64_sys_write(); 74.643 us
exit_to_user_mode_prepare(); 64.822 us
__x64_sys_write(); 24.032 us
__x64_sys_write(); 3.002 us
__x64_sys_write(); 2.408 us
__x64_sys_write(); 4.181 us
__x64_sys_write(); 3.662 us
__x64_sys_write(); 2.381 us
__x64_sys_write(); 23.284 us
__x64_sys_openat(); 79.258 us
__x64_sys_newfstatat(); 27.363 us
__x64_sys_write(); 45.040 us
[...]
The kernel trace proves that a context switch happens during a __x64_sys_openat
(ie. syscall done by the fopen
call) and the latency spike also happens at this time.
A deeper tracing show the function responsible for the wait:
__x64_sys_openat
do_sys_openat2
do_filp_open
path_openat
do_truncate
notify_change
ext4_setattr
truncate_pagecache
truncate_inode_pages_range
truncate_inode_partial_folio
folio_wait_writeback
folio_wait_bit
io_schedule
schedule
[task interruption] <---- takes ~95% of the time
Meanwhile, the first call to openat
calls truncate_inode_pages_range
but not truncate_inode_partial_folio
so there is no task interruption and no kernel thread to complete the task. In fact, all calls to openat
on "hellow.txt" cause truncate_inode_pages_range
to be called but only two calls truncate_inode_partial_folio
are made over the 5 first calls to fopen
. This function always calls schedule
in practice but only the first one is slow (subsequent calls take 20-700 us with an average time of 30 us). The truncate_pagecache
function tends to confirm that there is a cache, but this does not explain why subsequent calls to schedule
are faster.
When tracing the kernel threads, I ended up with traces like:
finish_task_switch.isra.0: 0.615 us
preempt_count_sub: 0.111 us
wq_worker_running: 0.246 us
_raw_spin_lock_irq: 0.189 us
process_one_work: 24.092 us <----- Actual kernel thread computation
This basically shows that the most important part (>95%) is missing profiling traces. Unfortunately, tracings functions like above (as well as using EBPF tools like ext4slower-bpfcc
) resulted in very inconsistent timings, mainly because of the way the time is measured (the absolute time is missing). One need to trace IO-based subsystems to understand this further.
The event profiling of the lower-level NVMe stack (using the ftrace-based trace-cmd
tool) shows that the first NVMe request is slow and subsequent ones are fast:
nvme_setup_cmd: 173732.202096 <----- First request started
nvme_sq: 173732.204539 <----- ~2.5 ms delay
nvme_complete_rq: 173732.204543
nvme_setup_cmd: 173732.204873 <----- Second request started
nvme_sq: 173732.204892 <----- No delay
nvme_complete_rq: 173732.204894
nvme_setup_cmd: 173732.205240
nvme_sq: 173732.205257 <----- Same here
nvme_complete_rq: 173732.205259
[...]
The fast that the second openat
call is slow is probably due to the synchronization (the IO scheduler wait for the previous request to be completed). The most probable reason is that the NVME device enters is sleeping mode when no requests has been sent for a relatively long time and it takes time to wake up.
The sleeping mode states are called Active Power States. They can be seen using the command smartctl -a /dev/nvme0
(in the smartmontools
package):
Supported Power States
St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat
0 + 7.80W - - 0 0 0 0 0 0
1 + 6.00W - - 1 1 1 1 0 0
2 + 3.40W - - 2 2 2 2 0 0
3 - 0.0700W - - 3 3 3 3 210 1200
4 - 0.0100W - - 4 4 4 4 2000 8000
The latency of the S3 and S4 mode is quite big but the consumption is also far smaller than others. This can be controlled using the nvme
command (in the nvme-cli
package). More specifically the get-feature
and set-feature
sub-commands. You can get more information about this here. In my case, I just wrote 1000 in the file /sys/class/nvme/nvme0/power/pm_qos_latency_tolerance_us
so the latency not to be bigger than 1 ms (require root privileges). Note that this file will be reset when the machine reboot.
WARNING: note that preventing the SSD to switch to deep sleeping modes can reduce the battery life on notebooks and increase the device temperature. In pathological cases (ie. poor NVMe controller), this could damage the device if it is not properly cooled. That being said, most device are protected against that using a throttling strategy.
Once the power QOS modified, the latency spike is now gone! Et voila! I get the following program output:
0: C Write, time = 0.289867 ms
1: C Write, time = 0.285233 ms
2: C Write, time = 0.225163 ms
3: C Write, time = 0.222544 ms
4: C Write, time = 0.212254 ms
[...]
Note this explains why the latency is not the same from one machine to another (and also the waiting time needed to enter in sleep mode), and why running GCC just before did not cause a latency spike.
Upvotes: 14
Reputation: 414
Should be easy to confirm the delay is in the kernel with strace
. If so, you can turn to tools like perf top
and sysdig
to identify the cause.
Upvotes: 1
Reputation: 51
Your issue with cpp implementation is likely due to std::endl being not only \n
but also a std::flush which forces the OS to write to the peripheral. However I could not reproduce your issue on my machine so maybe not code related:
Upvotes: -1