Deep Dive Kubernetes Notes: Container Isolation

Posted by Henry Du on Saturday, November 13, 2021

Deep Dive Kubernetes Notes: Container Process Isolation

Process

The binary executables are stored in the file system as a file. When the operating system starts to run one executable, it will load the file into the memory. For example, Linux executable has ELF format. In memory layout, it has a text area containing all the instructions. The instruction will be executed by CPU as an execution path. In the meantime, there may be files, I/O devices open and close associated with these executions. Thus, the form of the program is changed from a file to a set of environment of execution path, data in memory, files, I/O devices. All together, we call it a process. They use one PID to be grouped together. Visually, we are able to use ps command to list all running processes. We also are able to go to /proc/PID to view all resources.

Container is a running process

Container is a running process cloned by the operating system by passing some flags to use a new namespace. Like fork() and vfork(), the Linux-specific clone() system call creates a new process. The main use of clone() is in the implementation of threading libraries. When we specify the CLONE_NEWPID parameter in the call clone(), the newly created process will only see the isolated environment in its namespace. The following is the example of C code to pass the CLONE_NEWPID parameter.

int pid = clone(main_function, stack_size, CLONE_NEWPID | SIGCHLD, NULL);

Let’s run busybox by the famous docker run command. The entry point of busybox is /bin/sh. We also pass the -it to ask system to assign a TTY device for interactive input/output device, so that we can type ps -ef in this isolated environment. As we can see, the /bin/sh process holds the PID number 1.

> docker run -it busybox /bin/sh
/ # ps -ef
PID   USER     TIME  COMMAND
    1 root      0:00 /bin/sh
    8 root      0:00 ps -ef
/ #

Besides PID Namespace, Linux provides Mount, UTS, IPC, Network and User Namespace. They are used to isolate cloned processes under their namespace.

The part of the clone() flags bit-mask values

Flag Effect if present
CLONE_CHILD_CLEARTID Clear ctid when child calls exec() or _exit() (2.6 onward)
CLONE_CHILD_SETTID Write thread ID of child into ctid (2.6 onward)
CLONE_FILES Parent and child share table of open file descriptors
CLONE_FS Parent and child share attributes related to file system
CLONE_IO Child shares parent’s I/O context (2.6.25 onward)
CLONE_NEWIPC Child gets new System V IPC namespace (2.6.19 onward)
CLONE_NEWNET Child gets new network namespace (2.4.24 onward)
CLONE_NEWNS Child gets copy of parent’s mount namespace (2.4.19 onward)
CLONE_NEWPID Child gets new process-ID namespace (2.6.19 onward)
CLONE_NEWUSER Child gets new user-ID namespace (2.6.23 onward)
CLONE_NEWUTS Child gets new UTS (utsname()) namespace (2.6.19 onward)

Container is not a Hypervisor

When we use the docker run -it command or kubectl exec -it command, we are actually entering a different view of the process environment. It is not the guest OS running on the Hypervisor.

If we want to run a guest OS on top of a host OS, we need a hypervisor, such as Virtual Box, to virtualize system hardware, including CPU, memory and I/O devices, so that the guest OS thinks it is running on the real computer hardware system.

However, for the container running on the top of the docker engine, there is no such Docker container running on the host OS. Instead, Docker just helps to clone the process by using the new namespace parameters. Therefore, after we enter the container environment, we only see the process is using PID 1. We also can see the container has its own file folders, its own network devices and network stack, such as a routing table.

Reference

Deep Dive Kubernetes: Lei Zhang, a TOC member of CNCF.