Reputation: 563
I was trying to run my pod as non root and also grant it some capabilities.
This is my config:
containers:
- name: container-name
securityContext:
capabilities:
add: ["SETUID", "SYS_TIME"]
readOnlyRootFilesystem: true
runAsNonRoot: true
runAsUser: 1001
when I deploy my pod and connect to it I run ps aux
and see:
PID USER TIME COMMAND
1 root 0:32 node bla.js
205 root 0:00 /bin/bash
212 root 0:00 ps aux
I then do cat /proc/1/status
and see:
CapPrm: 0000000000000000
CapEff: 0000000000000000
Which means I have no capabilities for this container's process.
The thing is that if I remove the runAsNonRoot: true
flag from the securityContext
I can see I do have multiple capabilities.
Is there a way to run a pod as a non-root and still add some capabilities?
Upvotes: 5
Views: 3586
Reputation: 185
The accepted answer is only part the story. You can achieve what you are looking for, but you also need to set the capabilities on the file you are executing.
You'll want to have a read of the Linux capabilities(7) manual. That manual talks in terms of two different sources of capabilities sets coming together to form your final capabilities sets: the capabilities of "the thread" before execve() is called, and the file capabilities. The final capabilities sets are basically the intersection of these two sources as per the "Transformation of capabilities during execve()" section. When execve is running a process as root, the rules are a bit different but essentially the file capabilities are ignored as per the "Capabilities and execution of programs by root" section.
When you provide a set of capabilities to docker via the CLI flags, or in a docker compose file, or in a k8s securitycontext.capabilities block, ultimately this will cause containerd to cause runc to execve a process with those requested capabilities (as the bounding, permitted, and effective sets) set before the execve. The final capabilities set is then defined by the rules in the capabilities manual.
Capabilities when running as root
If you look in the capabilities(7) manual "Capabilities and execution of programs by root" section, you will find a description of what is happening in the accepted answer.
If the real or effective user ID of the process is 0 (root), then the file inheritable and permitted sets are ignored; instead they are notionally considered to be all ones (i.e., all capabilities enabled).
With the end result that the final capabilities set is just the one you provide.
For example given main.go:
package main
import "fmt"
import "kernel.org/pub/linux/libs/security/libcap/cap"
func main() {
c := cap.GetProc()
fmt.Printf("this process has these caps:", c)
}
dockerfile:
FROM golang:1.18 as build
WORKDIR /go/src/app
COPY <<EOF ./go.mod
module "app"
go 1.18
EOF
RUN go get "kernel.org/pub/linux/libs/security/libcap/cap"
RUN go mod download
COPY main.go .
RUN CGO_ENABLED=0 go build -o /go/bin/app
FROM gcr.io/distroless/static-debian11
COPY --from=build --chmod=550 --chown=0:0 /go/bin/app /
CMD ["/app"]
and docker-compose.yml:
version: '3.4'
services:
scratch:
user: 0:0
cap_drop:
- ALL
cap_add:
- SYS_TIME
build:
context: .
dockerfile: ./Dockerfile
Running scratch will print out this process has these caps:%!(EXTRA *cap.Set=cap_sys_time=ep)
.
Capabilities when running as non-root
In this situation the file capabilities are NOT ignored. From the manual
During an execve(2), the kernel calculates the new capabilities of the process using the following algorithm:
P'(ambient) = (file is privileged) ? 0 : P(ambient)
P'(permitted) = (P(inheritable) & F(inheritable)) |
(F(permitted) & P(bounding)) | P'(ambient)
P'(effective) = F(effective) ? P'(permitted) : P'(ambient)
P'(inheritable) = P(inheritable) [i.e., unchanged]
P'(bounding) = P(bounding) [i.e., unchanged]
where:
P() denotes the value of a thread capability set before
the execve(2)
P'() denotes the value of a thread capability set after the
execve(2)
F() denotes a file capability set
So if we update our dockerfile and compose file like so: dockerfile:
FROM golang:1.18 as build
WORKDIR /go/src/app
COPY <<EOF ./go.mod
module "app"
go 1.18
EOF
RUN go get "kernel.org/pub/linux/libs/security/libcap/cap"
RUN go mod download
COPY main.go .
RUN CGO_ENABLED=0 go build -o /go/bin/app
FROM gcr.io/distroless/static-debian11:nonroot
COPY --from=build --chmod=550 --chown=65534:65534 /go/bin/app /
CMD ["/app"]
and docker-compose.yml:
version: '3.4'
services:
scratch:
user: 65534:65534
cap_drop:
- ALL
cap_add:
- SYS_TIME
build:
context: .
dockerfile: ./Dockerfile
Running scratch will print out this process has these caps:%!(EXTRA *cap.Set==)
i.e. none! So you cannot use capabilities in the docker compose file alone when not running as root. (note 65534 is the "nobody" user and group).
However if we set the capabilities on the file in our dockerfile:
FROM golang:1.18 as build
RUN apt-get update
RUN apt-get install -y libcap2-bin
WORKDIR /go/src/app
COPY <<EOF ./go.mod
module "app"
go 1.18
EOF
RUN go get "kernel.org/pub/linux/libs/security/libcap/cap"
RUN go mod download
COPY main.go .
RUN CGO_ENABLED=0 go build -o /go/bin/app
RUN setcap 'cap_sys_time=ep' /go/bin/app
FROM gcr.io/distroless/static-debian11:nonroot
COPY --from=build --chmod=550 --chown=65534:65534 /go/bin/app /
CMD ["/app"]
With no further changes to the docker compose file the output is now this process has these caps:%!(EXTRA *cap.Set=cap_sys_time=ep)
. So the process is not running as root, but does have extra capabilities as you were after. This is a real pain in the neck if you want to add capabilities to something running in an interpreter (e.g. python or bash) however -- as the interpreter is the thing that needs the capabilities added.
Upvotes: 3
Reputation: 1701
This is the expected behavior. The capabilities are meant to divide the privileges traditionally associated with superuser (root) into distinct units; a non-root user cannot enable/disable such capabilities, that could create a security breach.
The capabilities
feature in the SecurityContext
key is designed to manage (either to limit or to expand) the Linux capabilities for the container's context; in a pod run as a root this means that the capabilities are inherited by the processes since these are owned by the root user; however, if the pod is run as a non-root user, it does not matter if the context has those capabilities enabled because the Linux Kernel will not allow a non-root user to set capabilities to a process.
This point can be illustrated very easily. If you run your container with the key runAsNonRoot
set to true
and add the capabilities as you did in the manifest shared, and then you exec into the Pod, you should be able to see those capabilities added to the context with the command:
$ capsh --print
Current: = cap_chown,cap_dac_override,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_net_bind_service,cap_net_raw,cap_sys_chroot,cap_sys_time,cap_mknod,cap_audit_write,cap_setfcap+i
Bounding set =cap_chown,cap_dac_override,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_net_bind_service,cap_net_raw,cap_sys_chroot,cap_sys_time,cap_mknod,cap_audit_write,cap_setfcap
But you will see the CapPrm
or CapEff
set to x0 in any process run by the user 1001:
$ ps aux
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
1001 1 0.0 0.0 4340 760 ? Ss 14:57 0:00 /bin/sh -c node server.js
1001 7 0.0 0.5 772128 22376 ? Sl 14:57 0:00 node server.js
1001 21 0.0 0.0 4340 720 pts/0 Ss 14:59 0:00 sh
1001 28 0.0 0.0 17504 2096 pts/0 R+ 15:02 0:00 ps aux
$ grep Cap proc/1/status
CapInh: 00000000aa0425fb
CapPrm: 0000000000000000
CapEff: 0000000000000000
CapBnd: 00000000aa0425fb
CapAmb: 0000000000000000
Upvotes: 4