Introduction
The inspiration to the following
research was a CTF task called namespaces
by _tsuro
from the 35C3
CTF. While solving this challenge we found out that
creating namespace-based sandboxes which can then be joined by
external processes is a pretty challenging task from a security
standpoint. On our way back home from the CTF we found out that
Docker, with its “docker exec” functionality (which is actually
implemented by runc from opencontainers) follows a similar model and
decided to challenge this implementation.
Goal and results
Our goal was to compromise the host
environment from inside a Docker container running in the default or
hardened configuration (e.g. limited capabilities and syscall
availability). We considered the two following attack vectors:
-
a malicious Docker image,
-
a malicious process inside a container (e.g. a compromised Dockerized service running as root).
Results: we have achieved full code
execution on the host, with all capabilities (i.e. on the
administrative ‘root’ access level), triggered by either:
-
running “docker exec” from the host, on a compromised Docker container,
-
starting a malicious Docker image.
This vulnerability was assigned
CVE-2019-5736 and was officially announced here.
Default Docker security settings
Despite Docker not being marketed as
sandboxing software, its default setup is meant to secure host
resources from being accessed by processes inside of a container.
Although the initial process inside a Docker container is running as
root, it has very limited privileges, which is achieved using several
mechanisms (this
paper describes it thoroughly):
Linux capabilities
Docker containers have a very limited
set of capabilities by default, which makes a container root user de
facto an unprivileged user.
seccomp
This mechanism blocks container’s
processes from executing a subset of syscalls or filters their
arguments (thus limiting its impact on the host environment.)
namespaces
This mechanism allows to limit
containerized processes’ access to the host filesystem, as well as
it limits the visibility of processes across the host/container
boundary.
cgroups
The control groups (cgroups) mechanism
allows to limit and manage various types of resources (RAM, CPU, ...)
of a group of processes.
It’s possible to disable all of these
mechanisms (for example by using the --privileged command-line
option) or to specify any set of syscalls/capabilities/shared
namespaces explicitly. Disabling those hardening mechanisms makes it
possible to easily escape the container. Instead, we will be looking
at Docker containers running the default security configuration.
Failed approaches
Before we ended up finding the final
vulnerability we had tried many other ideas, most of which were
mitigated by limited capabilities or by seccomp filters.
As the whole research was a follow-up
to a 35C3
CTF task, we started by investigating what happens
when a new process gets started in an existing namespace (a.k.a.
“docker exec”). The goal here was to check if we can access some
host resources by obtaining them from the newly joined process.
Specifically, we looked for ways to access that process from inside
the container before it joins all used namespaces. Imagine the
following scenario, where a process:
-
joins the user and PID namespaces,
-
forks (to actually join the PID namespace),
-
joins the rest of the namespaces (mount, net etc.).
If we could ptrace that process as soon
as it visible to us (i.e. right as it joined the PID namespace), we
could prevent it from joining the rest of the namespaces, which would
in turn enable e.g. host filesystem access.
Not having the required capabilities to
ptrace could be bypassed by performing an unshare of the user
namespace by the container init process (this yields the full set of
capabilities in the new user namespace). Then “docker exec” would
join our new namespace (obtained via “/proc/pid/ns/”) inside of
which we can ptrace (but seccomp limitations would still apply).
It turns out that runc joins all of the
required namespaces and only forks after having done so, which
prevents this attack vector. Additionally, the default Docker
configuration also disables all namespace related syscalls within the
container (setns, unshare etc.).
Next we focused solely on the proc
filesystem (more info: proc(5))
as it’s quite special and can often cross namespace boundaries. The
most interesting entries are:
-
/proc/pid/mem - This doesn’t give us much by itself, as the target process needs to already be in the same PID namespace as malicious one. The same applies to ptrace(2).
-
/proc/pid/cwd, /proc/pid/root - Before a process fully joins a container (after it joins namespaces but before it updates its root (chroot) and cwd (chdir)) these point to the host filesystem, which could possibly allow us to access it - but since the runc process is not dumpable (read more: http://man7.org/linux/man-pages/man2/ptrace.2.html), we cannot use those.
-
/proc/pid/exe - Not of any use just by itself (same reason as cwd and root), but we have found a way around that and used it in the final exploit (described below).
-
/proc/pid/fd/ - Some file descriptors may be leaked from ancestor namespaces (especially the mount namespace) or we could disturb parent - child (actually grandchild) communication in runc - we have found nothing of particular interest here as synchronisation was done with local sockets (can’t reuse those).
-
/proc/pid/map_files/ - A very interesting vector - before runc executes the target binary (but after the process is visible to us, i.e. it joined the PID namespace) all the entries refer to binaries from the host filesystem (since that is there where the process was originally spawned). Unfortunately, we discovered that we cannot follow these links without the SYS_ADMIN capability (source) - even from within the same process.
Side note:
When executing the following command:
/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2
/bin/ls -al /proc/self/exe
“/proc/self/exe” points to
“ld-linux-x86-64.so.2” (not “/bin/ls”, as one might think)
The attack idea was to force “docker
exec” to use dynamic loader from host to execute binary inside
container (by replacing original target to exec (e.g. “/bin/bash”)
with a text file with the first line:
#!/proc/self/map_files/address-in-memory-of-ld.so) /evil_binary
Then /evil_binary could overwrite
/proc/self/exe and thus overwrite the host ld.so. This approach was
unsuccessful due to the aforementioned SYS_ADMIN capability
requirement.
Side note 2:
While experimenting with the above we
found a deadlock in the kernel:
when a regular process tries to execve
“/proc/self/map_files/any-existing-entry”, it will deadlock (and
then opening “/proc/that-process-pid/maps” from any other process
will also hang - probably some lock taken).
Successful approach
The final successful attempts involved
an approach very similar to the aforementioned idea with
/proc/self/map_files - we execute /proc/self/exe, which is the host's
docker-runc binary, while still being able to inject some code (we
did that by changing some shared library, like libc.so, to also
execute our code e.g. inside libc_start_main or global constructor).
This gives us ability to overwrite /proc/self/exe binary which is the
docker-runc binary from the host, which in turn gives us full
capabilities root access on host next time docker-runc is executed.
Detailed attack description:
Craft a rogue image or compromise a
running container:
-
Make the entrypoint binary (or any binary that is likely to be runtime overridden by the user as the entrypoint, or as part of docker exec) be a symlink to /proc/self/exe
-
Replace any dynamic library used by docker-runc with a custom .so that has an additional global constructor. This function opens /proc/self/exe (which points to the host docker-run) for reading (it is impossible to open it for writing, since the binary is being executed right now, see ETXTBSY in open(2)). Then this function executes another binary which opens, this time for write, /proc/self/fd/3 (a file descriptor of docker-runc opened before execve), which succeeds because docker-runc is no longer being executed. The code can then overwrite the host docker-runc with anything - we have chosen a fake docker-runc with an additional global constructor that runs arbitrary code.
Thus, when a host user runs the
compromised image or “docker exec” on a compromised container :
-
The entrypoint/exec binary that has been symlinked to /proc/self/exe (which in turn points to docker-runc on the host filesystem) begins executing within the container (this will also cause process to be dumpable, as execve sets the dumpable flag). To be clear: this causes the original docker-runc process to re-execute into a new docker-runc running within the container (but using the host binary).
-
When docker-runc begins executing for the second time, it will load .so files from the container, not the host (because this is the visible filesystem now). As a reminder: we control the content of these dynamic libraries.
-
The malicious global constructor function will be executed. It will open /proc/self/exe for reading (let’s say it will have file descriptor 3) and execve()s some attacker controlled binary (let’s say /evil).
-
/evil will overwrite docker-runc on the host filesystem (by reopening fd 3, this time with write access) with a backdoored/malicious docker-runc (e.g. with an additional global constructor).
-
Now when any container is started or another exec is done, the attacker’s fake docker-runc will be executed as root with full capabilities on host filesystem (this binary is responsible for dropping privileges and entering namespaces, so initially it has full permissions).
Note that this attack only abuses runc
(opencontainers) behavior, so it should work for kubernetes as well,
regardless of whether it uses docker or cri-o (both may use runc
internally).
This attack has serious impact on AWS
and GCP
cloud services. More information about it can be found at linked
security bulletins.
Responsible disclosure
We have reported the vulnerability to
security@docker.com
the same day we discovered it, including a detailed attack
description and a proof of concept exploit. The next day the Docker
security team forwarded our email to security@opencontainers.org.
We also actively participated in discussions regarding fixing the
vulnerability. Communicating with the Docker and OpenContainers
security teams was frictionless and pleasant..
Rejected fix ideas in runc
Open the destination binary and compare inode info from fstat(2) with /proc/self/exe and exit if they match, otherwise execveat on destination binary fd.
This would detect if destination binary
is a symlink to /proc/self/exe. Why execveat? Because we want to
avoid the race condition where between comparison at exec some other
process will replace destination binary with link to /proc/self/exe.
Why wouldn’t this work?
This can be bypassed when attacker will not use symlink, but a binary with dynamic loader pointing to “/proc/self/exe”: e.g. text file which has “#!/proc/self/exe” as first line or just an elf file.
This can be bypassed when attacker will not use symlink, but a binary with dynamic loader pointing to “/proc/self/exe”: e.g. text file which has “#!/proc/self/exe” as first line or just an elf file.
Use a static binary to launch processes within the container
The idea of this is to avoid code
execution possibility via malicious .so files inside the container (a
static binary means no .so files are loaded).
Why wouldn’t this work?
Replacing .so files was not actually needed for this exploit. After the re-exec of /proc/self/exe (docker-runc), another process can just open /proc/<pid-of-docker-runc>/exe, which is possible because ”dumpable” flag is set on execve. This is a little bit harder to exploit because it requires to race the timing between the re-exec completing and runc process exiting (due to no parameters given). In practice, the race window is so large that we were able to develop a 100% successful exploit for such a scenario. However this would eliminate one of the attack vectors: running a rogue image.
Why wouldn’t this work?
Replacing .so files was not actually needed for this exploit. After the re-exec of /proc/self/exe (docker-runc), another process can just open /proc/<pid-of-docker-runc>/exe, which is possible because ”dumpable” flag is set on execve. This is a little bit harder to exploit because it requires to race the timing between the re-exec completing and runc process exiting (due to no parameters given). In practice, the race window is so large that we were able to develop a 100% successful exploit for such a scenario. However this would eliminate one of the attack vectors: running a rogue image.
Final fix applied in runc
In the end, the following fix was
applied to mitigate the vulnerability: :
-
Create a memfd (a special file which exists only in memory).
-
Copy the original runc binary to this fd.
-
Before entering namespaces re-exec runc from this fd.
This fix guarantees that if the
attacker overwrites the binary pointed to by /proc/self/exe then it
will not cause any damage to the host because it’s a copy of the
host binary, stored entirely in memory (tmpfs).
Mitigations
There are several mitigation
possibilities when using an unpatched runc:
-
Use Docker containers with SELinux enabled (--selinux-enabled). This prevents processes inside the container from overwriting the host docker-runc binary.
-
Use read-only file system on the host, at least for storing the docker-runc binary.
-
Use a low privileged user inside the container or a new user namespace with uid 0 mapped to that user (then that user should not have write access to runc binary on the host).
Timeline
-
1 January 2019 - Vulnerability discovered and PoC created
-
1 January - Vulnerability reported to security@docker.com
-
2 January - Report forwarded by docker security team to security@opencontainers.org
-
3 - 5 January - Discussion about fix ideas
-
11 February - end of CVE-2019-5736 embargo
-
13 February - this post publication
Authors: Adam Iwaniuk, Borys Popławski