Reputation: 795
Before I begin, I was at two minds as to whether this question should be raised in SuperUser or Stackoverflow - apologies in advance if it's in the incorrect location.
I have a docker container (contains C/C++ executable code) which performs audio/video processing. As a result, I would like to test the benefits of running the container with RT scheduling constraints. Searching the web, I've come across various bits of information, but I'm struggling to put all the pieces together.
System Environment:
In an executable nested in the docker image/container, the following code is executed to change the scheduler from 'SCHED_OTHER' to 'SCHED_FIFO' (see docs):
struct sched_param sched = {};
const int nMin = sched_get_priority_min(SCHED_FIFO);
const int nMax = sched_get_priority_max(SCHED_FIFO);
const int nHlf = (nMax - nMin) / 2;
const int nPriority = nMin + nHlf + 1;
sched.sched_priority = boost::algorithm::clamp(nPriority, nMin, nMax);
if (sched_setscheduler(0, SCHED_FIFO, &sched) < 0)
std::cerr << "SETSCHEDULER failed - err = " << strerror(errno) << std::endl;
else
std::cout << "Priority set to \"" << sched.sched_priority << "\"" << std::endl;
I've been reading varous bits of Docker documentation on using a realtime scheduler. One interesting page states,
Verify that CONFIG_RT_GROUP_SCHED is enabled in the Linux kernel by running zcat /proc/config.gz | grep CONFIG_RT_GROUP_SCHED or by checking for the existence of the file /sys/fs/cgroup/cpu.rt_runtime_us. For guidance on configuring the kernel realtime scheduler, consult the documentation for your operating system.
As per the aforementioned recommendation, the stock Ubuntu Zesty 17.04 OS seems to fail these checks.
First question(s): Cannot I use the RT scheduler? What is 'CONFIG_RT_GROUP_SCHED'? One thing that confuses me is that there are some older posts on the web from 2010-2012 about patching kernels with a RT patch. It seems that there has been some work in the Linux kernel related to soft RT since then.
The quote here has sparked my question:
From kernel version 2.6.18 onward, however, Linux is gradually becoming equipped with real-time capabilities, most of which are derived from the former realtime-preempt patches developed by Ingo Molnar, Thomas Gleixner, Steven Rostedt, and others. Until the patches have been completely merged into the mainline kernel (this is expected to be around kernel version 2.6.30), they must be installed to achieve the best real-time performance. These patches are named:
Carrying on...
Having read additional information, I note that it is important to set ulimits. I've altered /etc/security/limits.conf:
#* soft core 0
#root hard core 100000
#* hard rss 10000
# NEW ADDITION
gavin hard rtprio 99
Second question: Presumably the above is required to enable the docker daemon to run RT? It looks as if the daemon is controlled via systemd.
I continued further with my investigation and on the same Docker docs page saw the following snippet:
To run containers using the realtime scheduler, run the Docker daemon with the --cpu-rt-runtime flag set to the maximum number of microseconds reserved for realtime tasks per runtime period. For instance, with the default period of 10000 microseconds (1 second), setting --cpu-rt-runtime=95000 ensures that containers using the realtime scheduler can run for 95000 microseconds for every 10000-microsecond period, leaving at least 5000 microseconds available for non-realtime tasks. To make this configuration permanent on systems which use systemd, see Control and configure Docker with systemd.
Following this page, I discovered there were two parameters to the daemon that were of interest:
--cpu-rt-period int Limit the CPU real-time period in microseconds
--cpu-rt-runtime int Limit the CPU real-time runtime in microseconds
The same page indicates that docker daemon parameters can be specified via '/etc/docker/daemon.json', so I tried:
{
"cpu-rt-period": 92500,
"cpu-rt-runtime": 100000
}
Note: The docs do not specify the above options as 'allowed configuration options on Linux'. I thought I would give it a try nonetheless.
Docker daemon output upon restart:
-- Logs begin at Wed 2017-10-04 09:58:38 BST, end at Wed 2017-10-04 10:01:32 BST. --
Oct 04 09:58:47 gavin systemd[1]: Starting Docker Application Container Engine...
Oct 04 09:58:47 gavin dockerd[1501]: time="2017-10-04T09:58:47.885882588+01:00" level=info msg="libcontainerd: new containerd process, pid: 1531"
Oct 04 09:58:48 gavin dockerd[1501]: time="2017-10-04T09:58:48.053986072+01:00" level=warning msg="failed to rename /var/lib/docker/tmp for background deletion: %!s(<nil>).
Oct 04 09:58:48 gavin dockerd[1501]: time="2017-10-04T09:58:48.161303803+01:00" level=info msg="[graphdriver] using prior storage driver: aufs"
Oct 04 09:58:48 gavin dockerd[1501]: time="2017-10-04T09:58:48.303409053+01:00" level=info msg="Graph migration to content-addressability took 0.00 seconds"
Oct 04 09:58:48 gavin dockerd[1501]: time="2017-10-04T09:58:48.304002725+01:00" level=warning msg="Your kernel does not support swap memory limit"
Oct 04 09:58:48 gavin dockerd[1501]: time="2017-10-04T09:58:48.304078792+01:00" level=warning msg="Your kernel does not support cgroup rt period"
Oct 04 09:58:48 gavin dockerd[1501]: time="2017-10-04T09:58:48.304201239+01:00" level=warning msg="Your kernel does not support cgroup rt runtime"
Oct 04 09:58:48 gavin dockerd[1501]: time="2017-10-04T09:58:48.305534113+01:00" level=info msg="Loading containers: start."
Oct 04 09:58:48 gavin dockerd[1501]: time="2017-10-04T09:58:48.730193030+01:00" level=info msg="Default bridge (docker0) is assigned with an IP address 172.17.0.0/16. Daemo
Oct 04 09:58:48 gavin dockerd[1501]: time="2017-10-04T09:58:48.784938130+01:00" level=info msg="Loading containers: done."
Oct 04 09:58:48 gavin dockerd[1501]: time="2017-10-04T09:58:48.888035017+01:00" level=info msg="Daemon has completed initialization"
Oct 04 09:58:48 gavin dockerd[1501]: time="2017-10-04T09:58:48.888104120+01:00" level=info msg="Docker daemon" commit=89658be graphdriver=aufs version=17.05.0-ce
Oct 04 09:58:48 gavin dockerd[1501]: time="2017-10-04T09:58:48.903280645+01:00" level=info msg="API listen on /var/run/docker.sock"
Oct 04 09:58:48 gavin systemd[1]: Started Docker Application Container Engine.
The particular lines of interest:
Oct 04 09:58:48 gavin dockerd[1501]: time="2017-10-04T09:58:48.304078792+01:00" level=warning msg="Your kernel does not support cgroup rt period"
Oct 04 09:58:48 gavin dockerd[1501]: time="2017-10-04T09:58:48.304201239+01:00" level=warning msg="Your kernel does not support cgroup rt runtime"
Not surprising given my earlier discoveries.
Final question: When this is finally working, how will I be able to determine that my container is truly running with RT scheduling? Will the likes of 'top' suffice?
EDIT: I ran a kernel diagnostic script which I found through moby on github. This is the output:
warning: /proc/config.gz does not exist, searching other paths for kernel config ...
info: reading kernel config from /boot/config-4.10.0-35-generic ...
Generally Necessary:
- cgroup hierarchy: properly mounted [/sys/fs/cgroup]
- apparmor: enabled and tools installed
- CONFIG_NAMESPACES: enabled
- CONFIG_NET_NS: enabled
- CONFIG_PID_NS: enabled
- CONFIG_IPC_NS: enabled
- CONFIG_UTS_NS: enabled
- CONFIG_CGROUPS: enabled
- CONFIG_CGROUP_CPUACCT: enabled
- CONFIG_CGROUP_DEVICE: enabled
- CONFIG_CGROUP_FREEZER: enabled
- CONFIG_CGROUP_SCHED: enabled
- CONFIG_CPUSETS: enabled
- CONFIG_MEMCG: enabled
- CONFIG_KEYS: enabled
- CONFIG_VETH: enabled (as module)
- CONFIG_BRIDGE: enabled (as module)
- CONFIG_BRIDGE_NETFILTER: enabled (as module)
- CONFIG_NF_NAT_IPV4: enabled (as module)
- CONFIG_IP_NF_FILTER: enabled (as module)
- CONFIG_IP_NF_TARGET_MASQUERADE: enabled (as module)
- CONFIG_NETFILTER_XT_MATCH_ADDRTYPE: enabled (as module)
- CONFIG_NETFILTER_XT_MATCH_CONNTRACK: enabled (as module)
- CONFIG_NETFILTER_XT_MATCH_IPVS: enabled (as module)
- CONFIG_IP_NF_NAT: enabled (as module)
- CONFIG_NF_NAT: enabled (as module)
- CONFIG_NF_NAT_NEEDED: enabled
- CONFIG_POSIX_MQUEUE: enabled
Optional Features:
- CONFIG_USER_NS: enabled
- CONFIG_SECCOMP: enabled
- CONFIG_CGROUP_PIDS: enabled
- CONFIG_MEMCG_SWAP: enabled
- CONFIG_MEMCG_SWAP_ENABLED: missing
(cgroup swap accounting is currently not enabled, you can enable it by setting boot option "swapaccount=1")
- CONFIG_LEGACY_VSYSCALL_EMULATE: enabled
- CONFIG_BLK_CGROUP: enabled
- CONFIG_BLK_DEV_THROTTLING: enabled
- CONFIG_IOSCHED_CFQ: enabled
- CONFIG_CFQ_GROUP_IOSCHED: enabled
- CONFIG_CGROUP_PERF: enabled
- CONFIG_CGROUP_HUGETLB: enabled
- CONFIG_NET_CLS_CGROUP: enabled (as module)
- CONFIG_CGROUP_NET_PRIO: enabled
- CONFIG_CFS_BANDWIDTH: enabled
- CONFIG_FAIR_GROUP_SCHED: enabled
- CONFIG_RT_GROUP_SCHED: missing
- CONFIG_IP_VS: enabled (as module)
- CONFIG_IP_VS_NFCT: enabled
- CONFIG_IP_VS_RR: enabled (as module)
- CONFIG_EXT4_FS: enabled
- CONFIG_EXT4_FS_POSIX_ACL: enabled
- CONFIG_EXT4_FS_SECURITY: enabled
- Network Drivers:
- "overlay":
- CONFIG_VXLAN: enabled (as module)
Optional (for encrypted networks):
- CONFIG_CRYPTO: enabled
- CONFIG_CRYPTO_AEAD: enabled
- CONFIG_CRYPTO_GCM: enabled (as module)
- CONFIG_CRYPTO_SEQIV: enabled
- CONFIG_CRYPTO_GHASH: enabled (as module)
- CONFIG_XFRM: enabled
- CONFIG_XFRM_USER: enabled (as module)
- CONFIG_XFRM_ALGO: enabled (as module)
- CONFIG_INET_ESP: enabled (as module)
- CONFIG_INET_XFRM_MODE_TRANSPORT: enabled (as module)
- "ipvlan":
- CONFIG_IPVLAN: enabled (as module)
- "macvlan":
- CONFIG_MACVLAN: enabled (as module)
- CONFIG_DUMMY: enabled (as module)
- "ftp,tftp client in container":
- CONFIG_NF_NAT_FTP: enabled (as module)
- CONFIG_NF_CONNTRACK_FTP: enabled (as module)
- CONFIG_NF_NAT_TFTP: enabled (as module)
- CONFIG_NF_CONNTRACK_TFTP: enabled (as module)
- Storage Drivers:
- "aufs":
- CONFIG_AUFS_FS: enabled (as module)
- "btrfs":
- CONFIG_BTRFS_FS: enabled (as module)
- CONFIG_BTRFS_FS_POSIX_ACL: enabled
- "devicemapper":
- CONFIG_BLK_DEV_DM: enabled
- CONFIG_DM_THIN_PROVISIONING: enabled (as module)
- "overlay":
- CONFIG_OVERLAY_FS: enabled (as module)
- "zfs":
- /dev/zfs: missing
- zfs command: missing
- zpool command: missing
Limits:
- /proc/sys/kernel/keys/root_maxkeys: 1000000
Line of significance:
- CONFIG_RT_GROUP_SCHED: missing
Upvotes: 9
Views: 21188
Reputation: 2564
The solution given by Guy de Carufel did not work for me. What I had to do (clearly after having compiled my kernel with support for control groups) was to kill the Docker daemon first with the following commands:
$ sudo systemctl stop docker
$ sudo systemctl stop docker.socket
Then I could re-open the daemon assigning its control group a large time slice (such as 950000
):
$ sudo dockerd --cpu-rt-runtime=950000
These changes to the Docker daemon can be made permanent by configuring it as described here and here.
Then finally I could launch my container with the real-time scheduler as follows:
$ sudo docker run -it --cpu-rt-runtime=950000 --ulimit rtprio=99 ubuntu:20.04
In a Docker-Compose file you can achieve this with the following settings (as pointed out in the following documentation: 1, 2, 3):
cpu_rt_runtime: 950000
ulimits:
rtprio: 99
Launching the container additionally as privileged
and net=host
helps reduce overhead as discussed here and in this post.
The allocated real-time runtime cpu.rt_runtime_us
for each control group can be inspected in the /sys/fs/cgroup/cpu,cpuacct
folder. In case you have already allocated a large portion of real-time runtime to another cgroup this might result in the error message failed to write 95000 to cpu.rt_runtime_us: write /sys/fs/cgroup/cpu,cpuacct/system.slice/.../cpu.rt_runtime_us: invalid argument
or similar (see here and here). For more details on control groups in general see the corresponding official documentation (4, 5).
For real-time processes from inside a Docker I found the alternative to control groups, the PREEMPT_RT
patch, way more useful: You can install it easily from a Debian package and it is sufficient to run the Docker then with the privileged
option in order to set real-time priorities to processes from inside it. The advantage is mainly a significantly lower maximum latency compared to control groups. I have discussed this in more details in this post and created a Github repository with guides and scripts that help with the installation of PREEMPT_RT
.
Upvotes: 0
Reputation: 476
There are two options to do RT scheduling within a container:
SYS_NICE
capabilitydocker run --cap-add SYS_NICE ...
--privileged
flagdocker run --privileged ...
NOTE: --privileged
flag grants more permission than necessary!
The more limited --cap-add SYS_NICE
option is much safer.
You may also have to enable real-time scheduling in your sysctl. If you are running as the root user (default for Docker container):
sysctl -w kernel.sched_rt_runtime_us=-1
To make that permanent (update your image):
echo 'kernel.sched_rt_runtime_us=-1' >> /etc/sysctl.conf
https://docs.docker.com/engine/reference/run/#runtime-privilege-and-linux-capabilities
Upvotes: 11