Reputation: 324841
As part of some work I'm doing on container diagnostics tooling for Linux container systems like docker and containerd/runc, I've been looking for a way to inject or bind a mount from one mount namespace into another disjoint mount namespace.
Consider the following scenario
hostdir nsdir
------- -----
/ / [mountns 1, pidns 1, ]
/var/containers/container1-root / [mountns 2, pidns 2, propagation=private]
[not visible] /c1volume [mountns 2, pidns 2]
/var/containers/container2-root / [mountns 3, pidns 1, propagation=private] privileged]
container1
is a regular container. It has a volume mounted on c1volume
. Due to mount propagation rules, the host cannot see c1volume
, as it's mounted after the new mount namespace is entered.
container2
is run with the pid namespace of the host, so it can "see" out of the container to interact with the host. It's privileged, and can use nsenter
to container-break into the host mount namespace too.
The goal is to make the filesystem at /var/container/container2-root visible to processes running in container1's namespace, mount namespace 2, e.g. so that processes in container1
can access additional injected tools or utilities not usually included in their container image, and they see the pid numbers for pidns 2 (container1).
I haven't been able to figure out a way to do this.
Mount propagation rules mean that bind-mounting from the host's mount namespace does not make the bind mount visible to processes in container1
's mount namespace:
mkdir /var/containers/container1-root/container2
mount -o bind /var/containers/container2-root /var/containers/container1-root/container2
Changing the mount propagation of /var/containers/container1-root
appears to have no effect on this.
I could create a new mount and process namespace that can see /var/containers/container1-root
as /
and has a bind mount visible for /var/containers/container2-root
, but it won't see any of the processes in the original container1 pid namespace, and it won't see the mount of /c1volume
.
I've tried a great many variations of tricks with pivot_root
, unshare
, nsenter
, mount -o bind
etc, as yet to no avail.
The co-operation of the leader process (pid 1) of container1
is not available; this is an external injection from the container tooling layer.
Here's a setup recipe to create a demo environment with handmade containerization using low-level Linux primitives so you can see what's going on.
# create "container images" (static)
mkdir images
cd images
mkdir -p container1-root/{bin,proc,sys,dev,etc}
curl -sSLf -o container1-root/bin/busybox busybox https://busybox.net/downloads/binaries/1.35.0-x86_64-linux-musl/busybox
chmod +x container1-root/bin/busybox
for cmd in ls mount sh ; do ln -s busybox container1-root/bin/$cmd; done
cat > container1-root/enter <<'__END__'
#!/bin/sh
mount -t sysfs none /sys
exec /bin/busybox sh -i
__END__
chmod +x container1-root/enter
cp -aR container1-root container2-root
touch container1-root/container1
touch container2-root/container2
mkdir container1-root/c1volume
cd ..
# Create a volume for c1
mkdir -p volumes/c1volume
touch volumes/c1volume/i-see-c1volume
# create the container runtime dirs
for c in container1-root container2-root; do
mkdir -p {containers,workdirs,scratch}/$c
mount -t overlay overlay -o lowerdir=$PWD/images/$c,upperdir=$PWD/scratch/$c,workdir=$PWD/workdirs/$c $PWD/containers/$c
mount --make-rprivate $PWD/containers/$c
done
# [Terminal session 1: container1]
# Launch container1, with mounted volume not visible to the host and new pid namespace.
unshare -m
mount -o bind volumes/c1volume containers/container1-root/c1volume
ls containers/container1-root/c1volume/
unshare -p -m --mount-proc --fork --propagation private --wd=containers/container1-root --root=containers/container1-root /enter
PS1='container1 # '
ls /c1volume
echo $$
# [Terminal session 2: container2]
# This container shares the host pid namespace, but not mount namespace, and does not
# have a mounted volume.
unshare -m
unshare -m --mount-proc --fork --propagation private --wd=containers/container2-root --root=containers/container2-root /enter
PS1='container2 # '
Now, from the host, you will see
host # findmnt | egrep 'c1volume|container[12]'
├─/root/containers/container1-root overlay overlay rw,relatime,lowerdir=/root/images/container1-root,upperdir=/root/scratch/container1-root,workdir=/root/workdirs/container1-root
└─/root/containers/container2-root overlay overlay rw,relatime,lowerdir=/root/images/container2-root,upperdir=/root/scratch/container2-root,workdir=/root/workdirs/container2-root
no c1volume is visible, and
host # ls /root/containers/container1-root/c1volume/
host #
its bind-mounted contents are not visible.
A process in container2 can container-break and then nsenter
container 2:
container2 # /bin/busybox nsenter -t 1 -m -p /bin/bash -w /root
host # nsenter -t "$(lsof -t containers/container1-root)" --all -w -r /bin/sh
# ls /c1volume
i-see-c1volume
but has no way to access container2-root
from there.
It's possible to mount -o bind
into /proc/$(lsof -t containers/container1-root)/root/
, but mount propagation means this won't be seen from the existing processes in container1-root
. And if nsenter
or unshare
are used to first enter the mount namespace for container1, the container2-root file system is no longer visible so it cannot be bind-mounted.
Upvotes: 4
Views: 781
Reputation: 324841
So of course I work it out after finally writing this up. At least for my demo env, I have to compare to a real containerd to see.
The trick is that nsenter
without any --root
or --wd
will remain in the host rootdir and workdir, but enter the guest mount namespace. It is not necessary to enter the guest (container1) pid namespace as well.
host # c1leader="$(lsof -t containers/container1-root)"
host # nsenter -t $c1leader -m
host # findmnt -o +PROPAGATION | egrep 'container[12]|c1volume'
├─/root/containers/container1-root overlay overlay rw,relatime,lowerdir=/root/images/container1-root,upperdir=/root/scratch/container1-root,workdir=/root/workdirs/container1-root private
│ ├─/root/containers/container1-root/c1volume /dev/mapper/vgubuntu-root[/root/volumes/c1volume] ext4 rw,relatime,errors=remount-ro private
│ ├─/root/containers/container1-root/proc proc proc rw,nosuid,nodev,noexec,relatime private
│ │ └─/root/containers/container1-root/proc none proc rw,relatime private
│ └─/root/containers/container1-root/sys none sysfs rw,relatime private
└─/root/containers/container2-root overlay overlay rw,relatime,lowerdir=/root/images/container2-root,upperdir=/root/scratch/container2-root,workdir=/root/workdirs/container2-root private
host # mkdir /root/containers/container1-root/container2-root
host # mount -o bind,ro /root/containers/container2-root /root/containers/container1-root/container2-root
now in container1
's session:
container1 # ls /
bin c1volume container1 container2-root dev enter etc foo proc sys
container1 # ls /c1volume/
i-see-c1volume
container1 # ls container2-root/
bin container2 dev enter etc proc sys
container1 # busybox ps
PID USER TIME COMMAND
1 0 0:00 /bin/busybox sh -i
24 0 0:00 busybox ps
Upvotes: 3