Cross-host Container Networks

VXLAN Mode

Posted by Henry Du on Wednesday, December 1, 2021

Cross-host Container Networks - VXLAN mode

VXLAN

We have reviewed the UDP mode of cross-host container networks. We realized that there will be a performance issue because the IP packets have to be processed by the user space application. VXLAN allows us to handle the encapsulation/decapsulation in Linux kernel space.

Virtual Extensible LAN a.k.a VXLAN (RFC7348) is supported by Linux kernel. It is able to encapsulate the IP packets by another layer and construct the overlay networks like the UDP tunnel does.

The idea of VXLAN is to add one link layer (layer 2) on top of the existing layer 4 UDP datagram, using 4789 as UDP port number, so that all containers on the hosts think they are sharing the same LAN network. Those hosts may not be in one data center or even not in the same location.

VXLAN Mode

In order to have link layer tunnel, VXLAN needs a special device on each host, called VTEP (VXLAN Tunnel End Point) device.

Like flanneld process, the purpose of VTEP device is to encapsulate and decapsulate the Ethernet frame. All works are done in kernel space.

As diagram shown, there is a VTEP device named flannel.1. The device has an IP address as well as MAC address. When container-1 wants to talk to container-2, the destination IP 100.96.2.3 is encapsulated into inner IP header. The packets arrives to docker0 device on Node 1. Followed the routing table, it reaches to flannel.1 device.

In order to know where is the VTEP device on Node 2, the routing table has to have such information. The routing entries are maintained by flanneld process. For example, when Node 2 joins in Flannel network, all nodes, including Node 1, will have one more routing entry.

$ route -n
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
100.96.2.0      100.96.2.0      255.255.255.0   UG    0      0        0 flannel.1

It means, all network traffic which destination is 100.96.2.0/24 will go to network interface flannel.1. The IP address of flannel.1 on Node 2 belongs to this subnet. Therefore, the network traffic will be routed into Node 2.

Since VTEP devices are working on link layer. The Ethernet header will be added on the top of IP header. We need to find the destination MAC address of flannel.1 on Node 2. Since both flannel.1 devices have IP address, the ARP protocol will be used to resolve destination MAC address. When flanneld starts, the ARP record is added to Node 1 automatically. We could view it by ip command.

$ ip neigh show dev flannel.1
100.96.2.0 lladdr 34:36:3b:d2:06:94 PERMANENT

With the destination VTEP device MAC address and IP address, Linux kernel is able to prepare the inner ethernet frame.

However, the inner ethernet frame will not be recognized by the host link layer. So, Linux kernel needs to encapsulate the inner ethernet frame by another ethernet frame which can be recognized by the host network, so that the network packets can be routed from the host eth0 interface. Therefore, there is an outer Ethernet header.

Linux kernel will add a VXLAN header on top of inner Ethernet header to mark VXLAN encapsulation. In VXLAN header, there is a VNI flag, which can be used by VTEP device to distinguish different VXLAN tunnel. The default VNI value is 1 in Flannel network. That is the reason why we see the device named flannel.1.

Finally, Linux kernel will use UDP port 4789 to send the whole packet to the destination. Like UDP mode, the flannel.1 device on Node 1 communicates with flannel.1 device on Node 2 by UDP datagram. The most important, the encapsulated the data are actually inner Ethernet frame with VXLAN header.

However, how the UDP packet knows the eth0 IP address on Node 2?

The flannel.1 device is a network bridge as well. It is able to forward UDP packet like link layer packet. Linux kernel maintains a Forwarding Database, a.k.a FDB. The information for all Flannel network are maintained by flanneld. We can view the information by bridge fdb command on Node 1.

$ bridge fdb show falnnel.1 | grep 34:36:3b:d2:06:94
34:36:3b:d2:06:94 dev flannel.1 dst 10.168.0.3 self permanent

Therefore, in order to send the packet to the destination VTEP device with MAC address 34:36:3b:d2:06:94, the outer IP header uses destination IP address 10.168.0.3, which is the eth0 interface on Node 2.

Eventually, Linux kernel add the regular link layer header (outer ethernet header) on top of the outer IP header. The destination MAC is eth0's MAC on Node 2. It is available from local ARP table.

The whole packet is encapsulated as the diagram shown.

When Node 2 Linux kernel receives the packet, it detects the VXLAN Header and VNI number 1, then, it decapsulates the packet, forward it to flannel.1 device. The device will do the further decapsulation to get the inner IP address Header, finally, the IP packet from container-1 arrives to container-2.

Reference

Deep Dive Kubernetes: Lei Zhang, a TOC member of CNCF.