Container Networks

Posted by Henry Du on Tuesday, November 23, 2021

Container Networks

Network Namespace

Linux container network stack is isolated in its own network namespace. The network stack includes: Network Interface, Loopback Device, Routing Table and iptables chains/rules. The containerized process will use its own network stack to send and response the network request. If the running container wants to use the host network stack, the --net=host option will be used.

Let’s start a Nginx container by sharing the host network.

docker run -d --net=host -p 80:80 --name nginx-host nginx

After the Nginx starts, it will listen on TCP port 80 in the host network. Since the container will share the same host network stack, it is unavoidable that some processes will share the same port and binding the port fails. In most cases, we will set the containers to have their own network space.

root@docker-desktop:/# netstat -tpln
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name
tcp        0      0 0.0.0.0:111             0.0.0.0:*               LISTEN      -
tcp        0      0 0.0.0.0:80              0.0.0.0:*               LISTEN      1/nginx: master pro

It comes a question, how containers in one host communicate with each other. Every container has its own network namespace. If we want them to be able to send network traffic, it has to be a virtual network cable connect them. The Veth Pair device comes to the place.

Let’s restart a Nginx container by its own network namespace. Then, we could continue next section.

docker run -d -p 80:80 --name nginx-host nginx

Veth Pair device, when it is created, has two network interfaces, a.k.a. Veth Peer. We may think it is a point-to-point network link. Thus, one side network interface send data packets will be received by the other network interface, even though one end in one network namespace, the other end in the other network namespace.

After we enter the Nginx container, we won’t see any network interface located in the host network. Rather, we will see the network interface eth0. It is exactly the one end of the Veth Peer.

root@f83827aca31f:/# ifconfig
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 172.17.0.2  netmask 255.255.0.0  broadcast 0.0.0.0
        inet6 fe80::42:acff:fe11:2  prefixlen 64  scopeid 0x20<link>
        ether 02:42:ac:11:00:02  txqueuelen 0  (Ethernet)
        RX packets 2669  bytes 8909570 (8.4 MiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 2454  bytes 200664 (195.9 KiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
        inet 127.0.0.1  netmask 255.0.0.0
        inet6 ::1  prefixlen 128  scopeid 0x10<host>
        loop  txqueuelen 1000  (Local Loopback)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

We also are able to see container routing table.

root@d32ad39caa3b:/# route -n
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
0.0.0.0         172.17.0.1      0.0.0.0         UG    0      0        0 eth0
172.17.0.0      0.0.0.0         255.255.0.0     U     0      0        0 eth0

The network interface eth0 is the default routing network interface. For the second entry, all destination going to 172.17.0.0 will let eth0 network interface to forward.

The other end of the Veth Peer is located in host network.

docker0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 172.17.0.1  netmask 255.255.0.0  broadcast 0.0.0.0
        inet6 fe80::42:bcff:fe5e:d16d  prefixlen 64  scopeid 0x20<link>
        ether 02:42:bc:5e:d1:6d  txqueuelen 0  (Ethernet)
        RX packets 8  bytes 544 (544.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 24  bytes 3316 (3.2 KiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

veth83f2c5c: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet6 fe80::6015:54ff:fe9c:39b9  prefixlen 64  scopeid 0x20<link>
        ether 62:15:54:9c:39:b9  txqueuelen 0  (Ethernet)
        RX packets 8  bytes 656 (656.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 40  bytes 5384 (5.2 KiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

After we use brctl show command, we will see the other end of Veth Peer veth83f2c5c is “attached” to the bridge device docker0.

# brctl show
bridge name     bridge id               STP enabled     interfaces
docker0         8000.0242bc5ed16d       no              veth83f2c5c

So far, we are able to connect one container network into the host network by using Veth Pair. Next section, we will introduce how multiple containers in one host communicate with each other.

Multiple Containers Communication in One Host

Let’s start another Nginx container.

docker run -d --name nginx-host-2 nginx

We could see another Veth Peer network interface veth1b1d652 was attached to the docker0.

# brctl show
bridge name     bridge id               STP enabled     interfaces
docker0         8000.0242bc5ed16d       no              veth1b1d652
                                                        veth83f2c5c

After we enter the container nginx-host and ping container nginx-host-2 IP address 172.17.0.3, we observe they are connected.

When we issue the ping command to destination 172.17.0.3, the local routing table will be checked by long prefix matching rule. The second entry will be match.

Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
0.0.0.0         172.17.0.1      0.0.0.0         UG    0      0        0 eth0
172.17.0.0      0.0.0.0         255.255.0.0     U     0      0        0 eth0

The Gateway is 0.0.0.0. It means this route is a point-to-point link. All destinations which match this rule will go directly to the network interface eth0.

In order to construct ICMP (ping) packet, the destination MAC address is needed. The ARP (Address Resolution Protocol) broadcast packet will be send via eth0. It essentially broadcast the query that: who own the IP address 172.17.0.3, please let me know your MAC address.

The device eth0 has a pair device veth83f2c5c which is attached to bridge device docker0. All network interface attached to it are belong to one CSMA/CD area. Therefore, device docker0 will share the broadcast traffic to all other network interface, in our case, the device veth1b1d652. When container nginx-host-2 receives the ARP request, it will response the 172.17.0.3 MAC address back to container nginx-host. Thus, the ICMP packet with echo option is prepared and sent to the destination.

The following diagram illustrate the Ping network traffic flow.

iptables

When the network traffic from one network namespace travel to the other one, The Linux kernel Netfilter takes the important role. We could open iptables TRACE to watch the packets work flow.

$ iptables -t raw -A OUTPUT -p icmp -j TRACE
$ iptables -t raw -A PREROUTING -p icmp -j TRACE

After enabling the TRACE, we could watch the dmesg output.

[18414070.307075] TRACE: raw:PREROUTING:policy:5 IN=docker0 OUT= PHYSIN=veth83f2c5c MAC=02:42:ac:11:00:03:02:42:ac:11:00:02:08:00 SRC=172.17.0.2 DST=172.17.0.3 LEN=84 TOS=0x00 PREC=0x00 TTL=64 ID=32106 DF PROTO=ICMP TYPE=8 CODE=0 ID=31596 SEQ=1
[18414070.307083] TRACE: mangle:PREROUTING:rule:1 IN=docker0 OUT= PHYSIN=veth83f2c5c MAC=02:42:ac:11:00:03:02:42:ac:11:00:02:08:00 SRC=172.17.0.2 DST=172.17.0.3 LEN=84 TOS=0x00 PREC=0x00 TTL=64 ID=32106 DF PROTO=ICMP TYPE=8 CODE=0 ID=31596 SEQ=1
[18414070.307089] TRACE: mangle:PREROUTING_direct:return:1 IN=docker0 OUT= PHYSIN=veth83f2c5c MAC=02:42:ac:11:00:03:02:42:ac:11:00:02:08:00 SRC=172.17.0.2 DST=172.17.0.3 LEN=84 TOS=0x00 PREC=0x00 TTL=64 ID=32106 DF PROTO=ICMP TYPE=8 CODE=0 ID=31596 SEQ=1
[18414070.307094] TRACE: mangle:PREROUTING:rule:2 IN=docker0 OUT= PHYSIN=veth83f2c5c MAC=02:42:ac:11:00:03:02:42:ac:11:00:02:08:00 SRC=172.17.0.2 DST=172.17.0.3 LEN=84 TOS=0x00 PREC=0x00 TTL=64 ID=32106 DF PROTO=ICMP TYPE=8 CODE=0 ID=31596 SEQ=1
...
[18414070.307307] TRACE: mangle:FORWARD:rule:1 IN=docker0 OUT=docker0 PHYSIN=veth83f2c5c PHYSOUT=veth1b1d652 MAC=02:42:ac:11:00:03:02:42:ac:11:00:02:08:00 SRC=172.17.0.2 DST=172.17.0.3 LEN=84 TOS=0x00 PREC=0x00 TTL=64 ID=32106 DF PROTO=ICMP TYPE=8 CODE=0 ID=31596 SEQ=1
[18414070.307312] TRACE: mangle:FORWARD_direct:return:1 IN=docker0 OUT=docker0 PHYSIN=veth83f2c5c PHYSOUT=veth1b1d652 MAC=02:42:ac:11:00:03:02:42:ac:11:00:02:08:00 SRC=172.17.0.2 DST=172.17.0.3 LEN=84 TOS=0x00 PREC=0x00 TTL=64 ID=32106 DF PROTO=ICMP TYPE=8 CODE=0 ID=31596 SEQ=1
...
[18414070.307365] TRACE: mangle:POSTROUTING:rule:2 IN= OUT=docker0 PHYSIN=veth83f2c5c PHYSOUT=veth1b1d652 SRC=172.17.0.2 DST=172.17.0.3 LEN=84 TOS=0x00 PREC=0x00 TTL=64 ID=32106 DF PROTO=ICMP TYPE=8 CODE=0 ID=31596 SEQ=1
[18414070.307369] TRACE: mangle:POSTROUTING_direct:return:1 IN= OUT=docker0 PHYSIN=veth83f2c5c PHYSOUT=veth1b1d652 SRC=172.17.0.2 DST=172.17.0.3 LEN=84 TOS=0x00 PREC=0x00 TTL=64 ID=32106 DF PROTO=ICMP TYPE=8 CODE=0 ID=31596 SEQ=1

Therefore, The containerized process, which is isolated in its own network namespace, will use Veth Pair and Host network bridge to communicate other containers and even outside the host.

Reference

Deep Dive Kubernetes: Lei Zhang, a TOC member of CNCF.