Deep Dive Kubernetes Notes: Container Process Isolation
The binary executables are stored in the file system as a file. When the operating system starts to run one executable, it will load the file into the memory. For example, Linux executable has ELF format. In memory layout, it has a text area containing all the instructions. The instruction will be executed by CPU as an execution path. In the meantime, there may be files, I/O devices open and close associated with these executions. Thus, the form of the program is changed from a file to a set of environment of execution path, data in memory, files, I/O devices. All together, we call it a process. They use one PID to be grouped together. Visually, we are able to use
ps command to list all running processes. We also are able to go to
/proc/PID to view all resources.
Container is a running process
Container is a running process cloned by the operating system by passing some flags to use a new namespace. Like
vfork(), the Linux-specific
clone() system call creates a new process. The main use of
clone() is in the implementation of threading libraries. When we specify the
CLONE_NEWPID parameter in the call
clone(), the newly created process will only see the isolated environment in its namespace. The following is the example of C code to pass the
int pid = clone(main_function, stack_size, CLONE_NEWPID | SIGCHLD, NULL);
Let’s run busybox by the famous
docker run command. The entry point of busybox is
/bin/sh. We also pass the
-it to ask system to assign a TTY device for interactive input/output device, so that we can type
ps -ef in this isolated environment. As we can see, the
/bin/sh process holds the PID number 1.
> docker run -it busybox /bin/sh / # ps -ef PID USER TIME COMMAND 1 root 0:00 /bin/sh 8 root 0:00 ps -ef / #
Besides PID Namespace, Linux provides Mount, UTS, IPC, Network and User Namespace. They are used to isolate cloned processes under their namespace.
The part of the
clone() flags bit-mask values
|Flag||Effect if present|
|CLONE_CHILD_CLEARTID||Clear ctid when child calls exec() or _exit() (2.6 onward)|
|CLONE_CHILD_SETTID||Write thread ID of child into ctid (2.6 onward)|
|CLONE_FILES||Parent and child share table of open file descriptors|
|CLONE_FS||Parent and child share attributes related to file system|
|CLONE_IO||Child shares parent’s I/O context (2.6.25 onward)|
|CLONE_NEWIPC||Child gets new System V IPC namespace (2.6.19 onward)|
|CLONE_NEWNET||Child gets new network namespace (2.4.24 onward)|
|CLONE_NEWNS||Child gets copy of parent’s mount namespace (2.4.19 onward)|
|CLONE_NEWPID||Child gets new process-ID namespace (2.6.19 onward)|
|CLONE_NEWUSER||Child gets new user-ID namespace (2.6.23 onward)|
|CLONE_NEWUTS||Child gets new UTS (utsname()) namespace (2.6.19 onward)|
Container is not a Hypervisor
When we use the
docker run -it command or
kubectl exec -it command, we are actually entering a different view of the process environment. It is not the guest OS running on the Hypervisor.
If we want to run a guest OS on top of a host OS, we need a hypervisor, such as Virtual Box, to virtualize system hardware, including CPU, memory and I/O devices, so that the guest OS thinks it is running on the real computer hardware system.
However, for the container running on the top of the docker engine, there is no such Docker container running on the host OS. Instead, Docker just helps to clone the process by using the new namespace parameters. Therefore, after we enter the container environment, we only see the process is using PID 1. We also can see the container has its own file folders, its own network devices and network stack, such as a routing table.
Deep Dive Kubernetes: Lei Zhang, a TOC member of CNCF.