Injecting a mount into a disjoint mount namespace behind a private mount propagation?

Question

As part of some work I'm doing on container diagnostics tooling for Linux container systems like docker and containerd/runc, I've been looking for a way to inject or bind a mount from one mount namespace into another disjoint mount namespace.

Problem statement

Consider the following scenario

hostdir                                   nsdir
-------                                   -----
/                                         /         [mountns 1, pidns 1, ]
  /var/containers/container1-root         /         [mountns 2, pidns 2, propagation=private]
    [not visible]                         /c1volume [mountns 2, pidns 2]
  /var/containers/container2-root         /         [mountns 3, pidns 1, propagation=private] privileged]

container1 is a regular container. It has a volume mounted on c1volume. Due to mount propagation rules, the host cannot see c1volume, as it's mounted after the new mount namespace is entered.

container2 is run with the pid namespace of the host, so it can "see" out of the container to interact with the host. It's privileged, and can use nsenter to container-break into the host mount namespace too.

The goal is to make the filesystem at /var/container/container2-root visible to processes running in container1's namespace, mount namespace 2, e.g. so that processes in container1 can access additional injected tools or utilities not usually included in their container image, and they see the pid numbers for pidns 2 (container1).

I haven't been able to figure out a way to do this.

Mount propagation rules mean that bind-mounting from the host's mount namespace does not make the bind mount visible to processes in container1's mount namespace:

mkdir /var/containers/container1-root/container2
mount -o bind /var/containers/container2-root /var/containers/container1-root/container2

Changing the mount propagation of /var/containers/container1-root appears to have no effect on this.

I could create a new mount and process namespace that can see /var/containers/container1-root as / and has a bind mount visible for /var/containers/container2-root, but it won't see any of the processes in the original container1 pid namespace, and it won't see the mount of /c1volume.

I've tried a great many variations of tricks with pivot_root, unshare, nsenter, mount -o bind etc, as yet to no avail.

The co-operation of the leader process (pid 1) of container1 is not available; this is an external injection from the container tooling layer.

Demo setup

Here's a setup recipe to create a demo environment with handmade containerization using low-level Linux primitives so you can see what's going on.

# create "container images" (static)
mkdir images
cd images
mkdir -p container1-root/{bin,proc,sys,dev,etc} 
curl -sSLf -o container1-root/bin/busybox busybox https://busybox.net/downloads/binaries/1.35.0-x86_64-linux-musl/busybox
chmod +x container1-root/bin/busybox
for cmd in ls mount sh ; do ln -s busybox container1-root/bin/$cmd; done
cat > container1-root/enter <<'__END__'
#!/bin/sh
mount -t sysfs none /sys
exec /bin/busybox sh -i
__END__
chmod +x container1-root/enter
cp -aR container1-root container2-root
touch container1-root/container1
touch container2-root/container2
mkdir container1-root/c1volume
cd ..

# Create a volume for c1
mkdir -p volumes/c1volume
touch volumes/c1volume/i-see-c1volume

# create the container runtime dirs
for c in container1-root container2-root; do
mkdir -p {containers,workdirs,scratch}/$c
mount -t overlay overlay -o lowerdir=$PWD/images/$c,upperdir=$PWD/scratch/$c,workdir=$PWD/workdirs/$c $PWD/containers/$c
mount --make-rprivate $PWD/containers/$c
done

# [Terminal session 1: container1]
# Launch container1, with mounted volume not visible to the host and new pid namespace.
unshare -m 
mount -o bind volumes/c1volume containers/container1-root/c1volume
ls containers/container1-root/c1volume/
unshare -p -m --mount-proc --fork --propagation private --wd=containers/container1-root --root=containers/container1-root /enter
PS1='container1 # '
ls /c1volume
echo $$

# [Terminal session 2: container2]
# This container shares the host pid namespace, but not mount namespace, and does not
# have a mounted volume.
unshare -m
unshare -m --mount-proc --fork --propagation private --wd=containers/container2-root --root=containers/container2-root /enter
PS1='container2 # '

Demo

Now, from the host, you will see

host # findmnt | egrep 'c1volume|container[12]'
├─/root/containers/container1-root                  overlay                                        overlay         rw,relatime,lowerdir=/root/images/container1-root,upperdir=/root/scratch/container1-root,workdir=/root/workdirs/container1-root
└─/root/containers/container2-root                  overlay                                        overlay         rw,relatime,lowerdir=/root/images/container2-root,upperdir=/root/scratch/container2-root,workdir=/root/workdirs/container2-root

no c1volume is visible, and

host # ls /root/containers/container1-root/c1volume/
host #

its bind-mounted contents are not visible.

A process in container2 can container-break and then nsenter container 2:

container2 # /bin/busybox nsenter -t 1 -m -p /bin/bash -w /root
host # nsenter -t "$(lsof -t containers/container1-root)" --all -w -r /bin/sh
# ls /c1volume
i-see-c1volume

but has no way to access container2-root from there.

It's possible to mount -o bind into /proc/$(lsof -t containers/container1-root)/root/, but mount propagation means this won't be seen from the existing processes in container1-root. And if nsenter or unshare are used to first enter the mount namespace for container1, the container2-root file system is no longer visible so it cannot be bind-mounted.

Injecting a mount into a disjoint mount namespace behind a private mount propagation?

Problem statement

Demo setup

Demo

Answers (1)

Related Questions