Craig Ringer
Craig Ringer

Reputation: 324841

Injecting a mount into a disjoint mount namespace behind a private mount propagation?

As part of some work I'm doing on container diagnostics tooling for Linux container systems like docker and containerd/runc, I've been looking for a way to inject or bind a mount from one mount namespace into another disjoint mount namespace.

Problem statement

Consider the following scenario

hostdir                                   nsdir
-------                                   -----
/                                         /         [mountns 1, pidns 1, ]
  /var/containers/container1-root         /         [mountns 2, pidns 2, propagation=private]
    [not visible]                         /c1volume [mountns 2, pidns 2]
  /var/containers/container2-root         /         [mountns 3, pidns 1, propagation=private] privileged]

container1 is a regular container. It has a volume mounted on c1volume. Due to mount propagation rules, the host cannot see c1volume, as it's mounted after the new mount namespace is entered.

container2 is run with the pid namespace of the host, so it can "see" out of the container to interact with the host. It's privileged, and can use nsenter to container-break into the host mount namespace too.

The goal is to make the filesystem at /var/container/container2-root visible to processes running in container1's namespace, mount namespace 2, e.g. so that processes in container1 can access additional injected tools or utilities not usually included in their container image, and they see the pid numbers for pidns 2 (container1).

I haven't been able to figure out a way to do this.

Mount propagation rules mean that bind-mounting from the host's mount namespace does not make the bind mount visible to processes in container1's mount namespace:

mkdir /var/containers/container1-root/container2
mount -o bind /var/containers/container2-root /var/containers/container1-root/container2

Changing the mount propagation of /var/containers/container1-root appears to have no effect on this.

I could create a new mount and process namespace that can see /var/containers/container1-root as / and has a bind mount visible for /var/containers/container2-root, but it won't see any of the processes in the original container1 pid namespace, and it won't see the mount of /c1volume.

I've tried a great many variations of tricks with pivot_root, unshare, nsenter, mount -o bind etc, as yet to no avail.

The co-operation of the leader process (pid 1) of container1 is not available; this is an external injection from the container tooling layer.

Demo setup

Here's a setup recipe to create a demo environment with handmade containerization using low-level Linux primitives so you can see what's going on.

# create "container images" (static)
mkdir images
cd images
mkdir -p container1-root/{bin,proc,sys,dev,etc} 
curl -sSLf -o container1-root/bin/busybox busybox https://busybox.net/downloads/binaries/1.35.0-x86_64-linux-musl/busybox
chmod +x container1-root/bin/busybox
for cmd in ls mount sh ; do ln -s busybox container1-root/bin/$cmd; done
cat > container1-root/enter <<'__END__'
#!/bin/sh
mount -t sysfs none /sys
exec /bin/busybox sh -i
__END__
chmod +x container1-root/enter
cp -aR container1-root container2-root
touch container1-root/container1
touch container2-root/container2
mkdir container1-root/c1volume
cd ..

# Create a volume for c1
mkdir -p volumes/c1volume
touch volumes/c1volume/i-see-c1volume

# create the container runtime dirs
for c in container1-root container2-root; do
mkdir -p {containers,workdirs,scratch}/$c
mount -t overlay overlay -o lowerdir=$PWD/images/$c,upperdir=$PWD/scratch/$c,workdir=$PWD/workdirs/$c $PWD/containers/$c
mount --make-rprivate $PWD/containers/$c
done

# [Terminal session 1: container1]
# Launch container1, with mounted volume not visible to the host and new pid namespace.
unshare -m 
mount -o bind volumes/c1volume containers/container1-root/c1volume
ls containers/container1-root/c1volume/
unshare -p -m --mount-proc --fork --propagation private --wd=containers/container1-root --root=containers/container1-root /enter
PS1='container1 # '
ls /c1volume
echo $$

# [Terminal session 2: container2]
# This container shares the host pid namespace, but not mount namespace, and does not
# have a mounted volume.
unshare -m
unshare -m --mount-proc --fork --propagation private --wd=containers/container2-root --root=containers/container2-root /enter
PS1='container2 # '

Demo

Now, from the host, you will see

host # findmnt | egrep 'c1volume|container[12]'
├─/root/containers/container1-root                  overlay                                        overlay         rw,relatime,lowerdir=/root/images/container1-root,upperdir=/root/scratch/container1-root,workdir=/root/workdirs/container1-root
└─/root/containers/container2-root                  overlay                                        overlay         rw,relatime,lowerdir=/root/images/container2-root,upperdir=/root/scratch/container2-root,workdir=/root/workdirs/container2-root

no c1volume is visible, and

host # ls /root/containers/container1-root/c1volume/
host # 

its bind-mounted contents are not visible.

A process in container2 can container-break and then nsenter container 2:

container2 # /bin/busybox nsenter -t 1 -m -p /bin/bash -w /root
host # nsenter -t "$(lsof -t containers/container1-root)" --all -w -r /bin/sh
# ls /c1volume
i-see-c1volume

but has no way to access container2-root from there.

It's possible to mount -o bind into /proc/$(lsof -t containers/container1-root)/root/, but mount propagation means this won't be seen from the existing processes in container1-root. And if nsenter or unshare are used to first enter the mount namespace for container1, the container2-root file system is no longer visible so it cannot be bind-mounted.

Upvotes: 4

Views: 781

Answers (1)

Craig Ringer
Craig Ringer

Reputation: 324841

So of course I work it out after finally writing this up. At least for my demo env, I have to compare to a real containerd to see.

The trick is that nsenter without any --root or --wd will remain in the host rootdir and workdir, but enter the guest mount namespace. It is not necessary to enter the guest (container1) pid namespace as well.

host # c1leader="$(lsof -t containers/container1-root)"
host # nsenter -t $c1leader -m
host # findmnt -o +PROPAGATION | egrep 'container[12]|c1volume'
├─/root/containers/container1-root                  overlay                                           overlay         rw,relatime,lowerdir=/root/images/container1-root,upperdir=/root/scratch/container1-root,workdir=/root/workdirs/container1-root private
│ ├─/root/containers/container1-root/c1volume       /dev/mapper/vgubuntu-root[/root/volumes/c1volume] ext4            rw,relatime,errors=remount-ro                                                                                                   private
│ ├─/root/containers/container1-root/proc           proc                                              proc            rw,nosuid,nodev,noexec,relatime                                                                                                 private
│ │ └─/root/containers/container1-root/proc         none                                              proc            rw,relatime                                                                                                                     private
│ └─/root/containers/container1-root/sys            none                                              sysfs           rw,relatime                                                                                                                     private
└─/root/containers/container2-root                  overlay                                           overlay         rw,relatime,lowerdir=/root/images/container2-root,upperdir=/root/scratch/container2-root,workdir=/root/workdirs/container2-root private
host # mkdir /root/containers/container1-root/container2-root
host # mount -o bind,ro /root/containers/container2-root /root/containers/container1-root/container2-root

now in container1's session:

container1 # ls /
bin              c1volume         container1       container2-root  dev              enter            etc              foo              proc             sys
container1 # ls /c1volume/
i-see-c1volume
container1 # ls container2-root/
bin         container2  dev         enter       etc         proc        sys
container1 # busybox ps
PID   USER     TIME  COMMAND
    1 0         0:00 /bin/busybox sh -i
   24 0         0:00 busybox ps

Upvotes: 3

Related Questions