Cilium: K8S Service Load Balancing - Part 1

Posted by Henry Du on Monday, November 30, 2020

Cilium: K8S Service Load Balancing - Part 1

This blog is my reading note when I read K8S Service Load Balancing with BPF & XDP, presented by Daniel Borkmann and Martynas Pumputis in Linux Plumbers Conference.

Kubernetes Networking Basic

I have summarized kubernetes network feature when I introduced Flannel CNI. Kubenetes network is a flat network in the sense that each pod must be reachable by its IP address within a cluster. The pod is an instance running in one node with its own network namespace and cgroup. K8S only defines the network model. CNI will take response to create a network, assign pod IP, and remove a network when pod is gone.

Pod IP

As diagram shown below, there are three pods spread in two nodes, with their own pod IP address. The client 192.168.0.1 can easily to access one pod by pod IP address.

However, the pod managed by K8S may come and go. K8S cannot guarantee assign the same pod IP each time start it. In addition, there is no load balancing because all request from client go to this pod by provided IP address.

Host Port

As diagram shown below, we can assign a port into network namespace that belongs to the node host. Then, forwarding all traffic to the backend pod by using this host port.

However, HostPort to local pod is one-to-one mapping, that is, only one pod can backup the HostPort on a node. In addition, the node host iptables will apply DNAT rule first before sending traffic to backend pod. There is no load balancing as well.

NodePort Service

NodePort is the enhancement of HostPort. It maps pod port to host port in a network namespace. In addition, every node in K8S cluster will reserve the same port for all pods. So, it is possible to load balancing the traffic to different node by using the same port.

As diagram shown below, the NodePort is 30001. When a client request traffic reaches port 30001, iptables applies DNAT and forwards traffic to backend pod with port 80.

This is how K8S Service abstraction coming from. We can apply multiple pods to use the same service, no matter if backend pod comes and go, the IP service is not changed. The client will not care which node that the pod is located. Here load balancing takes a place in the sense that every node can load the traffic.

The connectivity from host network namespace on every node w/o DNS through any local address, e.g. 127.0.0.1:NodePort.

However, as diagram shown below, SNAT-based implementations hide client IP address and introduce extra hop for replies if backend is remote.

Service with External IP

External IP is well-known Virtual IP in firewall world. It exposes a virtual IP to outside K8S cluster, and mapping it to an internal service IP address, which is Pod IP. This is typical DNAT case.

As diagram above, 1.1.1.1 is an external IP. All traffic go to 1.1.1.1 will be DNAT to node 1 destination 10.1.0.1:80. It can impersonate any public IP inside the cluster as long as network routes to these nodes.

However, the external IPs are not managed by K8S. It needs to be announced, by, e.g. BGP, to route traffic to node. Also, exposing an external IP is not a good practice due to potential of traffic spoofing.

LoadBalancer Service On-Premise

There is a simple boundary between cloud and on-premise. All services which managed by service provider, including installation, upgrading, monitoring and maintenance, are called cloud. In contrast, all services which managed by customers, are called on-premise.

The load balancer implementation done by cloud providers or MetalLB for on-prem. MetalLB can announce via ARP/NDP or BGP. Load balancer IPs managed via K8S, not via CNI plugin, but LoadBalancer implementation. MetalLB does IP address allocation and external announcement, but does not sit in critical fast path.

LoadBalancer Service Cloud

For LoadBalancer for the Cloud,

  • All major cloud providers offer this for their managed K8S, including EKS, GKE, AKS.
  • No additional user setup with regarding to BGP etc.
  • Cloud LB performs health checks to probe individual backend nodes from its LB whether they respond.

However, it has downside that two layers of LB. Cloud LB programming time can be slow.

ClusterIP Service

ClusterIP service is the way that pods can reach each others inside K8S cluster. ClusterIP service is also a VIP.

There is dedicated IP range for ClusterIP, non-routable, always translated locally to backend. For in-cluster access only.

In fact, when create a LoadBalancer service, K8S creates the following three types of service for us:

  • LoadBalancer
  • NodePort
  • ClusterIP They all associate with the same set of backend pods.

Conclusion

This is the first part of K8S load balancing service. There are also various K8S features for services like sessionAffinity or externalTrafficPolicy. We will introduce Cilium’s service LB in next post.