A Day in the Life of a Packet in AKS (part 2): kubenet and ingress controller

Hey again, to complete the previous post on the Azure CNI, here it goes using kubenet instead. To make it a bit more interesting we are going to explore a bunch of additional stuff:

  • Deploying AKS with kubenet in your own vnet (note that this is not well documented or supported by Microsoft at the time of this writing, but it is nevertheless interesting!)
  • Ingress controller packet walk

Other posts in this series:

  • Part 1: deep dive in AKS with Azure CNI in your own vnet
  • Part 2 (this post): deep dive in AKS with kubenet in your own vnet, and ingress controllers
  • Part 3: outbound connectivity from AKS pods
  • Part 4: NSGs with Azure CNI cluster
  • Part 5: Virtual Node
  • Part 6: Network Policy with Azure CNI

 

Deploying the cluster

Assuming you already have an existing vnet (from Part 1), you just need to create a new subnet and deploy the new cluster with the kubenet plugin. Again, note that the only option officially supported to deploy AKS in your own vnet is the Azure CNI plugin, but we want to see how this thing works. Additionally the setup with kubenet is interesting to deploy AKS clusters in subnets without a lot of address space, since the pods are not deployed inside the Vnet prefix:

rg=akstest
aksname=aksPacketWalkKubenet
vnet=aksVnet
subnet=kubenet
az network vnet subnet create -g $rg -n $subnet --vnet-name $vnet --address-prefix 10.13.77.0/24
subnetid=$(az network vnet subnet show -g $rg --vnet-name $vnet -n $subnet --query id -o tsv)
az aks create -g $rg -n $aksname -c 2 --generate-ssh-keys -s Standard_B2ms -k 1.11.5 --network-plugin kubenet --vnet-subnet-id $subnetid
az aks get-credentials -g $rg -n $aksname

Note that we created the cluster with two nodes, since we are going to test later how inter-node communication works.

If you already had a jump host from the previous post, you can reuse it. You will have to upload you public SSH key to both k8s nodes:

noderg=$(az aks show -g $rg -n $aksname --query nodeResourceGroup -o tsv)
az vm user update -g $noderg -n <node_name> --username <admin_username> --ssh-key-value ~/.ssh/id_rsa.pub

Verify cluster creation, create example app

Check that your credentials are working, and while doing that have a look at the IP addresses of your pods:

$ k get node -o wide
NAME                       STATUS    ROLES     AGE       VERSION   INTERNAL-IP   EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION      CONTAINER-RUNTIME
aks-nodepool1-31351229-0   Ready     agent     23h       v1.11.5   10.13.77.4    <none>        Ubuntu 16.04.5 LTS   4.15.0-1035-azure   docker://3.0.1
aks-nodepool1-31351229-1   Ready     agent     6m        v1.11.5   10.13.77.5    <none>        Ubuntu 16.04.5 LTS   4.15.0-1035-azure   docker://3.0.1

Now we are ready to deploy an example app. I use this file, which I named whereami.yaml:

apiVersion: apps/v1beta1
kind: Deployment
metadata:
  name: whereami
spec:
  replicas: 2
  strategy:
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 1
  minReadySeconds: 5
  template:
    metadata:
      labels:
        app: whereami
    spec:
      containers:
      - name: whereami
        image: erjosito/whereami:1.3
        ports:
        - containerPort: 80
---
apiVersion: v1
kind: Service
metadata:
  name: whereami
spec:
  type: LoadBalancer
  ports:
  - port: 80
  selector:
    app: whereami

You can deploy with this command:

k apply -f ./whereami.yaml

After a while, you will see the service with its external IP address populated (initially it will remain in Pending):

$ k get svc
NAME         TYPE           CLUSTER-IP     EXTERNAL-IP    PORT(S)        AGE
kubernetes   ClusterIP      10.0.0.1       <none>         443/TCP        22h
whereami     LoadBalancer   10.0.223.248   23.101.73.47   80:31808/TCP   3m

You can see the IP address of the two endpoints (pods) associated to the service:

$ k get ep whereami
NAME       ENDPOINTS                       AGE
whereami   10.244.0.10:80,10.244.1.2:80   34m

Check that they are in different nodes. Otherwise, delete one of the pods, so that Kubernetes recreates it, until they are in different nodes:

$ k get pod -o wide
NAME                       READY     STATUS    RESTARTS   AGE       IP            NODE                       NOMINATED NODE
whereami-564765b89-g4xq5   1/1       Running   0          1m        10.244.1.2    aks-nodepool1-31351229-1   <none>
whereami-564765b89-mx45h   1/1       Running   0          34m       10.244.0.10   aks-nodepool1-31351229-0   <none>

Azure Load Balancer

By now you know the drill. Let us look at the load balancer in Azure:

$ lb=$(az network lb list -g $noderg -o tsv --query [0].name)
$ az network lb rule list -g $noderg --lb-name $lb -o table
BackendPort    EnableFloatingIp    EnableTcpReset    FrontendPort    IdleTimeoutInMinutes    LoadDistribution    Name                                     Protocol    ProvisioningState    ResourceGroup
-------------  ------------------  ----------------  --------------  ----------------------  ------------------  ---------------------------------------  ----------  -------------------  ------------------------------------------
80             True                False             80              4                       Default             af6ce64281dbb11e9a6ba269b1ccf60c-TCP-80  Tcp         Succeeded            MC_akstest_aksPacketWalkKubenet_westeurope

As expected, one rule with port 80, and Direct Server Return (aka Floating IP) enabled. See the Azure CNI blog for additional details here. Let’s have a look at the probe:

$ az network lb probe list -g $noderg --lb-name $lb -o table
IntervalInSeconds    Name                                     NumberOfProbes    Port    Protocol    ProvisioningState    ResourceGroup
-------------------  ---------------------------------------  ----------------  ------  ----------  -------------------  ------------------------------------------
5                    af6ce64281dbb11e9a6ba269b1ccf60c-TCP-80  2                 31808   Tcp         Succeeded            MC_akstest_aksPacketWalkKubenet_westeurope

As with the Azure CNI plugin, the probes here are using the NodePort endpoint of the service.

eth and veth interfaces

Let us jump now to one of the nodes, and see whether there are any differences as compared to the Azure CNI plugin. You will need to go through you jump host, and connect to one of the two nodes. I have selected the first one, but as long as you have pods in both, it does not really matter.

Looking at the node interfaces, you might notice the first difference: the IP address is assigned to eth0, not to azure0 as in the CNI plugin:

jose@aks-nodepool1-31351229-0:~$ ifconfig eth0
eth0      Link encap:Ethernet  HWaddr 00:0d:3a:29:9c:bb
          inet addr:10.13.77.4  Bcast:10.13.77.255  Mask:255.255.255.0
          inet6 addr: fe80::20d:3aff:fe29:9cbb/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:3020483 errors:0 dropped:0 overruns:0 frame:0
          TX packets:2344952 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:2071123015 (2.0 GB)  TX bytes:375223625 (375.2 MB)

The bridge interface, which is called here cbr0, has a different IP address, to connect to the subnet where the pods are deployed:

jose@aks-nodepool1-31351229-0:~$ ifconfig cbr0
cbr0      Link encap:Ethernet  HWaddr aa:20:0e:58:7a:52
          inet addr:10.244.0.1  Bcast:0.0.0.0  Mask:255.255.255.0
          inet6 addr: fe80::a820:eff:fe58:7a52/64 Scope:Link
          UP BROADCAST RUNNING PROMISC MULTICAST  MTU:1500  Metric:1
          RX packets:1845245 errors:0 dropped:0 overruns:0 frame:0
          TX packets:1755882 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:243898101 (243.8 MB)  TX bytes:1019560630 (1.0 GB)

With the Azure CNI nodes and pods share the same IP address space. With kubenet, pods are implemented in a completely separate network. The only way that nodes have to get into that pod network is through the cbr0 bridge.

In order to inspect this bridge, we need to install the brctl command:

jose@aks-nodepool1-31351229-0:~$ sudo apt install -y bridge-utils

...

jose@aks-nodepool1-31351229-0:~$ brctl show
bridge name     bridge id               STP enabled     interfaces
cbr0            8000.aa200e587a52       no              veth02cc98a5
                                                        veth14f0efed
                                                        veth30950c3b
                                                        veth59d70255
                                                        veth6313441d
                                                        veth7719b83f
                                                        veth915270d7
                                                        vethccccd4ca
docker0         8000.02424dfdc1f8       no

Nothing too interesting here: a bunch of veth interfaces (each of which is piped to the interface of a container), plus the well-known docker0 bridge. Let’s try to find out which of these interfaces is linked to the pods of the application we deployed. We will use different methods that should yield the same result, you can pick your favorite. Let’s start with the same way we used in the Azure CNI post. First, we need to find out the container ID of one of our pods:

jose@aks-nodepool1-31351229-0:~$ sudo docker ps | grep whereami
cbaf4a69ef18        c8e4ff7df026                 "/bin/sh -c '/usr/sb…"   36 minutes ago      Up 36 minutes                           k8s_whereami_whereami-564765b89-mx45h_default_f6b320b5-1dbb-11e9-a6ba-269b1ccf60c7_0
fe5ef71557aa        k8s.gcr.io/pause-amd64:3.1   "/pause"                 36 minutes ago      Up 36 minutes                           k8s_POD_whereami-564765b89-mx45h_default_f6b320b5-1dbb-11e9-a6ba-269b1ccf60c7_0

With that container ID, we now go for the Process ID (PID) associated to it:

jose@aks-nodepool1-31351229-0:~$ sudo docker inspect --format '{{ .State.Pid }}' cbaf4a69ef18
3138

With the process ID we can now go into the network namespace of the container:

jose@aks-nodepool1-31351229-0:~$ sudo nsenter -t 3138 -n ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
3: eth0@if13: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
    link/ether ca:38:e3:c1:9f:49 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 10.244.0.10/24 scope global eth0
       valid_lft forever preferred_lft forever

Multiple information pieces here: the container’s MAC address, its IP address, and the number of the node interface where it is linked: number 13 in this case. Let’s find it out:

jose@aks-nodepool1-31351229-0:~$ ip a | grep 13:
13: veth6313441d@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master cbr0 state UP group default

Let us verify that veth6313441d is indeed the traffic connected to our container, by checking that the the container’s MAC address is visible through it. First we need to get the port number of that interface, for which we will use the showstp command of brctl:

jose@aks-nodepool1-31351229-0:~$ sudo brctl showstp cbr0 | grep veth6313441d
veth6313441d (7)

Now we can have a look at the MAC addresses learnt from that port:

jose@aks-nodepool1-31351229-0:~$ sudo brctl showmacs cbr0 | grep -E "\s+7\s+"
  7     1e:8d:fc:38:95:6e       yes                0.00
  7     1e:8d:fc:38:95:6e       yes                0.00
  7     ca:38:e3:c1:9f:49       no                 2.01

And sure enough, the third MAC entry is the one we are looking for. Just for fun, there is another method of learning which is the associated interface. I have put it on a script , in case you want to try yourself:

if [ -z "${1+x}" ]
then
  echo "Use: $0 <container ID>"
else
  linkid=$(sudo docker exec -it $1 bash -c 'cat /sys/class/net/eth0/iflink')
  linkid=${linkid//[!0-9]/}
  indexpath=$(/bin/grep -l $linkid /sys/class/net/veth*/ifindex)
  IFS='/' read -r -a array <<< "$indexpath"
  ifname=${array[4]}
  echo $ifname
fi

For example, if you store the script above with the name getveth.sh, you can use it like this:

jose@aks-nodepool1-31351229-0:~$ ./getveth.sh cbaf4a69ef18
veth6313441d

Routing

One of the benefits of knowing the PID of a container is that we can run commands without having to exec into it:

jose@aks-nodepool1-31351229-0:~$ sudo nsenter -t 3138 -n netstat -rnv
Kernel IP routing table
Destination     Gateway         Genmask         Flags   MSS Window  irtt Iface
0.0.0.0         10.244.0.1      0.0.0.0         UG        0 0          0 eth0
10.244.0.0      0.0.0.0         255.255.255.0   U         0 0          0 eth0

As you can see, the pods’ default route is the cbr0 bridge, that is how it reaches the outer world. But how does the host know which containers are behind the cbr0 bridge? Easy, another route:

jose@aks-nodepool1-31351229-0:~$ netstat -rnv
Kernel IP routing table
Destination     Gateway         Genmask         Flags   MSS Window  irtt Iface
0.0.0.0         10.13.77.1      0.0.0.0         UG        0 0          0 eth0
10.13.77.0      0.0.0.0         255.255.255.0   U         0 0          0 eth0
10.244.0.0      0.0.0.0         255.255.255.0   U         0 0          0 cbr0
168.63.129.16   10.13.77.1      255.255.255.255 UGH       0 0          0 eth0
169.254.169.254 10.13.77.1      255.255.255.255 UGH       0 0          0 eth0
172.17.0.0      0.0.0.0         255.255.0.0     U         0 0          0 docker0

Note that the route refers to a /24 subnet. Each node has its own /24 subnet for the pods it hosts. You can find out which subnet which node has with kubectl:

$ k describe node/aks-nodepool1-31351229-0 | grep CIDR
PodCIDR:                     10.244.0.0/24

It is good that the node knows how to reach the pods, but will Azure know? How does Azure know which subnet is allocated to each host? Part of the AKS magic consists in defining a route-table with the corresponding subnets pointing to each host’s private IP address. Let’s check whether there is any route table in the node resource group:

$ az network route-table list -g $noderg -o table
DisableBgpRoutePropagation    Location    Name                               ProvisioningState    ResourceGroup
----------------------------  ----------  ---------------------------------  -------------------  ------------------------------------------
False                         westeurope  aks-agentpool-31351229-routetable  Succeeded            MC_akstest_aksPacketWalkKubenet_westeurope

Now we know its name, let’s see what routes it contains:

$ rtname=$(az network route-table list -g $noderg --query [0].name -o tsv)
$ az network route-table route list -g $noderg --route-table-name $rtname -o table
AddressPrefix    Name                      NextHopIpAddress    NextHopType       ProvisioningState    ResourceGroup
---------------  ------------------------  ------------------  ----------------  -------------------  ------------------------------------------
10.244.0.0/24    aks-nodepool1-31351229-0  10.13.77.4          VirtualAppliance  Succeeded            MC_akstest_aksPacketWalkKubenet_westeurope
10.244.1.0/24    aks-nodepool1-31351229-1  10.13.77.5          VirtualAppliance  Succeeded            MC_akstest_aksPacketWalkKubenet_westeurope

One per host, as expected. This route needs to be applied to the subnet where the AKS nodes are. Let us have a look at all the subnets in our Vnet, and whether they have any route table attached:

$ az network vnet subnet list -g $rg --vnet-name $vnet --query [].[name,routeTable] -o tsv
aks     None
kubenet None
vms     None

The route table is not applied to the subnet! That is because we are running here a sort of unsupported scenario: deploying kubenet-based AKS on an existing Vnet. However, there is an easy fix for this. First, let’s verify that pods in different nodes cannot talk to each other, since Azure network does not know how to route the packets:

jose@aks-nodepool1-31351229-0:~$ sudo nsenter -t 3138 -n ping 10.244.1.2
PING 10.244.1.2 (10.244.1.2) 56(84) bytes of data.

You will not see any ICMP reply (as long as the pods are running in different nodes). If you just leave this ping running, you can use another terminal window to attach the route table to our subnet:

subnet=kubenet
rtid=$(az network route-table list -g $noderg --query [0].id -o tsv)
az network vnet subnet update -g $rg --vnet-name $vnet -n $subnet --route-table $rtid

After around 20 seconds ping between the containers should work, since Azure knows to which node each packet should be forwarded. The reason for this is that the pod addresses are visible into the Azure network, as opposed to other CNI plugins that obfuscate the pod addresses behind the node addresses through some kind of encapsulation (usually VXLAN).

Since Azure seems to need to know how to reach the pod addresses, it does not look like there is any encapsulation at play here. Let us verify that, but capturing the ping traffic. Ping one pod from the other, and run a network capture in one of the nodes eth0 interface (these are the packets as they come and go to the network):

jose@aks-nodepool1-31351229-0:~$ sudo nsenter -t 3138 -n ping 10.244.1.2
PING 10.244.1.2 (10.244.1.2) 56(84) bytes of data.
64 bytes from 10.244.1.2: icmp_seq=1 ttl=62 time=1.00 ms
64 bytes from 10.244.1.2: icmp_seq=2 ttl=62 time=0.901 ms                                           │
64 bytes from 10.244.1.2: icmp_seq=3 ttl=62 time=1.07 ms
...

Our ping is working, leave it running and run a packet capture in another terminal:

jose@aks-nodepool1-31351229-1:~$ sudo tcpdump -n -i eth0 icmp
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
21:32:32.249757 IP 10.244.0.10 > 10.244.1.2: ICMP echo request, id 31592, seq 1, length 64
21:32:32.249866 IP 10.244.1.2 > 10.244.0.10: ICMP echo reply, id 31592, seq 1, length 64

Verified, kubenet is not using any network encapsulation between the nodes.

Traffic to the application

Now that we have looked at inter-node communication, let us turn our attention back to the traffic entering/leaving the cluster. As we did with the Azure CNI, let’s have a look at iptables in one of the nodes, where the magic happens. You can see the relevant entries with this command:

jose@aks-nodepool1-31351229-0:~$ sudo iptables-save | grep whereami

I will break down the output of the previous parts in multiple sections, for easier understanding. First there are some rule chains that match on traffic addressed to the  nodeport endpoint of our kubernetes service, such as the probes coming from the Azure Load Balancer:

-A KUBE-NODEPORTS -p tcp -m comment --comment "default/whereami:" -m tcp --dport 31808 -j KUBE-MARK-MASQ
-A KUBE-NODEPORTS -p tcp -m comment --comment "default/whereami:" -m tcp --dport 31808 -j KUBE-SVC-7G2JV7LNOR6DDNIY

Then there are two rules that cover the cluster service, still not relevant for us:

-A KUBE-SERVICES ! -s 10.244.0.0/16 -d 10.0.223.248/32 -p tcp -m comment --comment "default/whereami: cluster IP" -m tcp --dport 80 -j KUBE-MARK-MASQ
-A KUBE-SERVICES -d 10.0.223.248/32 -p tcp -m comment --comment "default/whereami: cluster IP" -m tcp --dport 80 -j KUBE-SVC-7G2JV7LNOR6DDNIY

When we hit the service from outside the cluster, the load balancer will send the traffic with the public IP address (remember that our rule is configured with Floating IP / Direct Server Return enabled). This is the rule we will hit:

-A KUBE-SERVICES -d 23.101.73.47/32 -p tcp -m comment --comment "default/whereami: loadbalancer IP" -m tcp --dport 80 -j KUBE-FW-7G2JV7LNOR6DDNIY

As you can see, the previous rule is jumping (-j) to the FW rules:

-A KUBE-FW-7G2JV7LNOR6DDNIY -m comment --comment "default/whereami: loadbalancer IP" -j KUBE-MARK-MASQ
-A KUBE-FW-7G2JV7LNOR6DDNIY -m comment --comment "default/whereami: loadbalancer IP" -j KUBE-SVC-7G2JV7LNOR6DDNIY
-A KUBE-FW-7G2JV7LNOR6DDNIY -m comment --comment "default/whereami: loadbalancer IP" -j KUBE-MARK-DROP

The previous rules mark the packets for SNAT (do not be surprised when you find out later that the pod does not see our real IP), and the second sends it to the SVC rules. The third rule would drop the packet, in case the SVC rules do not match:

-A KUBE-SVC-7G2JV7LNOR6DDNIY -m comment --comment "default/whereami:" -m statistic --mode random --probability 0.50000000000 -j KUBE-SEP-KFF4FJCPYGQEXMDG
-A KUBE-SVC-7G2JV7LNOR6DDNIY -m comment --comment "default/whereami:" -j KUBE-SEP-2QV3R2JVLUZRDSSA

The previous rules jump to the Service End Point (SEP) rules, that DNAT the traffic going to each pod (rules 2 and 4 in the following output) as well as SNAT the return traffic coming from the nginx pod (rules 1 and 3) back to the public VIP:

-A KUBE-SEP-2QV3R2JVLUZRDSSA -s 10.244.1.2/32 -m comment --comment "default/whereami:" -j KUBE-MARK-MASQ
-A KUBE-SEP-2QV3R2JVLUZRDSSA -p tcp -m comment --comment "default/whereami:" -m tcp -j DNAT --to-destination 10.244.1.2:80
-A KUBE-SEP-KFF4FJCPYGQEXMDG -s 10.244.0.10/32 -m comment --comment "default/whereami:" -j KUBE-MARK-MASQ
-A KUBE-SEP-KFF4FJCPYGQEXMDG -p tcp -m comment --comment "default/whereami:" -m tcp -j DNAT --to-destination 10.244.0.10:80

We mentioned above something about “marking for NAT”. This is the rule that marks the packet, which is a 4-byte mark applied to the packet as it traverses different software layers through the Linux OS (search for Netfilter packet marks if you need more info about this). For example, this rule uses the first bit of the first byte (0x8 is in binary 1000) to mark a packet for later dropping.

jose@aks-nodepool1-31351229-0:~$ sudo iptables-save | grep '\-A KUBE-MARK-DROP'
-A KUBE-MARK-DROP -j MARK --set-xmark 0x8000/0x8000

And here we use the second bit of the first byte (0x4 is 0100) to mark a packet for later masquerading:

jose@aks-nodepool1-31351229-0:~$ sudo iptables-save | grep '\-A KUBE-MARK-MASQ'
-A KUBE-MARK-MASQ -j MARK --set-xmark 0x4000/0x4000

Lastly, the KUBE-POSTROUTING chain does not match on any packet attribute, but on the mask:

jose@aks-nodepool1-31351229-0:~$ sudo iptables-save | grep MASQUERADE
-A POSTROUTING -s 172.17.0.0/16 ! -o docker0 -j MASQUERADE
-A POSTROUTING ! -d 10.0.0.0/8 -m iprange ! --dst-range 168.63.129.16-168.63.129.16 -m addrtype ! --dst-type LOCAL -j MASQUERADE
-A KUBE-POSTROUTING -m comment --comment "kubernetes service traffic requiring SNAT" -m mark --mark 0x4000/0x4000 -j MASQUERADE

If you visit the webpage from your browser or with curl, you should be able to see a connection with the conntrack tool:

jose@aks-nodepool1-31351229-0:~$ sudo conntrack -L -d 23.101.73.47
tcp 6 86397 ESTABLISHED src=109.125.120.58 dst=23.101.73.47 sport=62585 dport=80 src=10.244.0.10 dst=10.244.0.1 sport=80 dport=62585 [ASSURED] mark=0 use=1
conntrack v1.4.3 (conntrack-tools): 1 flow entries have been shown.

The previous output shows that the ingress packet comes with source 109.125.120.58 (the public address of the client) and destination 23.125.120.58 (the VIP at the LB). When the return packet from the pod comes, it has source 10.244.0.10 (the pod’s private IP) and destination 10.244.0.1 (the nginx pod initiates a brand new TCP connection to the backend pod using its private IP addresses).

HTTP Ingress Controller

We will not do again the exercise of configuring the externalTrafficPolicy in the service to prevent SNAT, so that the pod sees the real customer IP address (if you want to see that, you can look at my previous post regarding the Azure CNI). Let’s instead have a look at something more interesting, such as an ingress controller.

In AKS you can install an nginx-based ingress controller to your cluster with a single command:

az aks enable-addons -a http_application_routing -g $rg -n $aksname

After a while, if you look at the services in the kube-system namespace you will see a couple of interesting things:

$ k -n kube-system get svc
NAME                                                  TYPE           CLUSTER-IP     EXTERNAL-IP     PORT(S)                      AGE
addon-http-application-routing-default-http-backend   ClusterIP      10.0.64.107    <none>          80/TCP                       8m
addon-http-application-routing-nginx-ingress          LoadBalancer   10.0.183.181   51.136.49.205   80:30321/TCP,443:30286/TCP   8m
heapster                                              ClusterIP      10.0.234.0     <none>          80/TCP                       1d
kube-dns                                              ClusterIP      10.0.0.10      <none>          53/UDP,53/TCP                1d
kubernetes-dashboard                                  ClusterIP      10.0.1.125     <none>          80/TCP                       1d
metrics-server                                        ClusterIP      10.0.37.165    <none>          443/TCP                      1d

Other than the standard services for the right operation of the AKS cluster, you have two new ones:

  • A service for the nginx-ingress of type LoadBalancer, since it needs to be reachable from the outside world
  • A service for a default HTTP backend. This is the page that the ingress controller will show, in case it does not know the requested resource

Since the ingress service had a LoadBalancer type, something should have happened in the Azure Load Balancer:

$ az network lb frontend-ip list -g $noderg --lb-name $lb -o table
Name                              PrivateIpAllocationMethod    ProvisioningState    ResourceGroup
--------------------------------  ---------------------------  -------------------  ------------------------------------------
af6ce64281dbb11e9a6ba269b1ccf60c  Dynamic                      Succeeded            MC_akstest_aksPacketWalkKubenet_westeurope
a4c1a6c4c1e1d11e99cfe8afbff07ff3  Dynamic                      Succeeded            MC_akstest_aksPacketWalkKubenet_westeurope

We see indeed a new frontend configuration, let’s confirm that the associated public IP matches what we saw configured in the Kubernetes service for nginx:

$ pipid=$(az network lb frontend-ip show -g $noderg --lb-name $lb -n a4c1a6c4c1e1d11e99cfe8afbff07ff3 --query publicIpAddress.id -o tsv)
$ az network public-ip show --id $pipid --query ipAddress -o tsv
51.136.49.205

Perfect! Now that we are sure that this is the right rule, we can have a look at additional load balancing rules. They will have a similar name as the frontend IP configuration:

$ az network lb rule list -g $noderg --lb-name $lb -o table
BackendPort    EnableFloatingIp    EnableTcpReset    FrontendPort    IdleTimeoutInMinutes    LoadDistribution    Name                                      Protocol    ProvisioningState    ResourceGroup
-------------  ------------------  ----------------  --------------  ----------------------  ------------------  ----------------------------------------  ----------  -------------------  ------------------------------------------
80             True                False             80              4                       Default             af6ce64281dbb11e9a6ba269b1ccf60c-TCP-80   Tcp         Succeeded            MC_akstest_aksPacketWalkKubenet_westeurope
80             True                False             80              4                       Default             a4c1a6c4c1e1d11e99cfe8afbff07ff3-TCP-80   Tcp         Succeeded            MC_akstest_aksPacketWalkKubenet_westeurope
443            True                False             443             4                       Default             a4c1a6c4c1e1d11e99cfe8afbff07ff3-TCP-443  Tcp         Succeeded            MC_akstest_aksPacketWalkKubenet_westeurope

Two new rules, one for port 80, and another one for port 443. This is a hint that ingress controllers only work for HTTP(S) traffic. Let’s have a look at the probes:

$ az network lb probe list -g $noderg --lb-name $lb -o table
IntervalInSeconds    Name                                      NumberOfProbes    Port    Protocol    ProvisioningState    ResourceGroup                               RequestPath
-------------------  ----------------------------------------  ----------------  ------  ----------  -------------------  ------------------------------------------  -------------
5                    af6ce64281dbb11e9a6ba269b1ccf60c-TCP-80   2                 31808   Tcp         Succeeded            MC_akstest_aksPacketWalkKubenet_westeurope
5                    a4c1a6c4c1e1d11e99cfe8afbff07ff3-TCP-80   2                 31924   Http        Succeeded            MC_akstest_aksPacketWalkKubenet_westeurope  /healthz
5                    a4c1a6c4c1e1d11e99cfe8afbff07ff3-TCP-443  2                 31924   Http        Succeeded            MC_akstest_aksPacketWalkKubenet_westeurope  /healthz

As usual, the probe goes to the Nodeport. There is something interesting too: the probe is of type HTTP, and goes to the /healthz path. As we saw in the Azure CNI blog post, this is a hint that externalTrafficPolicy is set to local, so that traffic will hit the nginx controllers without being NATted.

You can save this configuration under the name whereami-ingress.yaml:

apiVersion: apps/v1beta1
kind: Deployment
metadata:
  name: whereami-ingress
spec:
  replicas: 2
  strategy:
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 1
  minReadySeconds: 5
  template:
    metadata:
      labels:
        app: whereami-ingress
    spec:
      containers:
      - name: whereami-ingress
        image: erjosito/whereami:1.3
        ports:
        - containerPort: 80
---
apiVersion: v1
kind: Service
metadata:
  name: whereami-ingress
spec:
  type: ClusterIP
  ports:
  - port: 80
  selector:
    app: whereami-ingress
---
apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  name: whereami-ingress
  annotations:
    kubernetes.io/ingress.class: addon-http-application-routing
spec:
  rules:
  - host: whereami-ingress.bf52096edbd84ca1b3b6.westeurope.aksapp.io
    http:
      paths:
      - backend:
          serviceName: whereami-ingress
          servicePort: 80
        path: /

As you can see, the config is pretty similar to the previous ones, with two major differences:

  • The service is now of type ClusterIP, since it is not reachable from outside the cluster
  • This ClusterIP service will be accessed from the ingress controller (our nginx pods). The kubernetes ingress object tells the ingress which URLs are to be mapped to the ClusterIP service.

If you are wondering how I came up with that hostname in the ingress spec, you need to use the DNS prefix that was created along the ingress controller. You can verify it in two ways. You can look at the HTTPApplicationRoutingZoneName attribute of the cluster:

zonename=$(az aks show -g $rg -n $aksname --query addonProfiles.httpApplicationRouting.config.HTTPApplicationRoutingZoneName -o tsv)

Or you can have a look at the new DNS zone created in the node resource group:

zonename=$(az network dns zone list -g $noderg --query [0].name -o tsv)

Now you can deploy this to the cluster:

k apply -f ./whereami-ingress.yaml

Now we can see the new ingress that has been created. An ingress is a way of telling our nginx frontend how to reach our service, based either on the hostname which is passed as HTTP header, or in the path in the URL.

$ k get ingress
NAME               HOSTS                                                        ADDRESS         PORTS     AGE
whereami-ingress   whereami-ingress.bf52096edbd84ca1b3b6.westeurope.aksapp.io   51.136.49.205   80        11m

In order for this to work, the new hostname needs to be resolvable in the public Internet. This is what the DNS Zone created in the node resource group does. If we look for A records there, we will find what we are looking for:

$ az network dns record-set a list -g $noderg -z $zonename --query [].[name,arecords[0].ipv4Address] -o tsv
whereami-ingress        51.136.49.205

Which is the public IP address assigned to the LoadBalancer service for our ingress controller. After a while (DNS needs some seconds/minutes to propagate over the Internet), nslookup will resolve the new name and you can browse to it (use the name, not the IP!). The generated page contains some information which is interesting for our packet walk:

$ curl -s whereami-ingress.$zonename | grep -E 'Remote address|X-Forwarded-For|Private IP address'
         <li>Private IP address: 10.244.0.13</li>
         <li>Remote address: 10.244.1.6</li>
         <li>X-Forwarded-For HTTP header: 109.125.120.58</li>

Let’s see what this means:

  • The pod where we ended up has the private IP of 10.244.0.13
  • It is seeing the packets with a source IP of 10.244.1.6, which correspond to our nginx pod. This makes sense, nginx needs to SNAT the packets so that return traffic goes through it again.
  • Lastly, nginx was gentle enough to put the original client IP in the X-Forwarded-For HTTP header, so that we can use this information for application logging

Let us verify the private IP address of the nginx pod, 10.244.1.6:

$ k -n kube-system get pod -o wide
NAME                                                              READY     STATUS    RESTARTS   AGE       IP           NODE                       NOMINATED NODE
addon-http-application-routing-default-http-backend-5ccb95flhzj   1/1       Running   0          2h        10.244.1.7   aks-nodepool1-31351229-1   <none>
addon-http-application-routing-external-dns-5c8c885957-9l2jq      1/1       Running   0          2h        10.244.1.5   aks-nodepool1-31351229-1   <none>
addon-http-application-routing-nginx-ingress-controller-ffbbqj6   1/1       Running   0          2h        10.244.1.6   aks-nodepool1-31351229-1   <none>
heapster-5d6f9b846c-rhtbq                                         2/2       Running   0          2h        10.244.1.4   aks-nodepool1-31351229-1   <none>
kube-dns-v20-7c7d7d4c66-lqgb4                                     4/4       Running   0          1d        10.244.0.6   aks-nodepool1-31351229-0   <none>
kube-dns-v20-7c7d7d4c66-p2969                                     4/4       Running   0          1d        10.244.0.7   aks-nodepool1-31351229-0   <none>
kube-proxy-c4mxr                                                  1/1       Running   0          2h        10.13.77.5   aks-nodepool1-31351229-1   <none>
kube-proxy-djnfs                                                  1/1       Running   0          2h        10.13.77.4   aks-nodepool1-31351229-0   <none>
kube-svc-redirect-j4dhd                                           2/2       Running   0          1d        10.13.77.4   aks-nodepool1-31351229-0   <none>
kube-svc-redirect-njr7l                                           2/2       Running   0          13h       10.13.77.5   aks-nodepool1-31351229-1   <none>
kubernetes-dashboard-68f468887f-hzx6q                             1/1       Running   1          1d        10.244.0.4   aks-nodepool1-31351229-0   <none>
metrics-server-5cbc77f79f-w9psr                                   1/1       Running   1          1d        10.244.0.3   aks-nodepool1-31351229-0   <none>
tunnelfront-76d5496779-nkrqp                                      1/1       Running   0          5h        10.244.1.3   aks-nodepool1-31351229-1   <none>

Essentially we have two flows:

  1. Client web browser – nginx: this communication goes through an Azure External Load Balancer. As we will verify, no SNAT is performed, so that nginx gets the real IP address of the client (the first hint at this was the HTTP-based health probe in the Azure LB, remember?)
  2. nginx – application pod: this is intra-cluster pod-to-pod communication using internal IPs

If you analyze the iptables config for the nginx ingress controller pod, to verify the absence of NATting. I have reordered the output of the command sudo iptables-save | grep nginx for ease of read. I have selected the rules for port 80, similar rules exist for 443 too:

First, the LoadBalancer service rule:

-A KUBE-SERVICES -d 51.136.49.205/32 -p tcp -m comment --comment "kube-system/addon-http-application-routing-nginx-ingress:http loadbalancer IP" -m tcp --dport 80 -j KUBE-FW-PXYKE4WDT2UXBLRE

Which jumps to the FW rules:

-A KUBE-FW-PXYKE4WDT2UXBLRE -m comment --comment "kube-system/addon-http-application-routing-nginx-ingress:http loadbalancer IP" -j KUBE-XLB-PXYKE4WDT2UXBLRE
-A KUBE-FW-PXYKE4WDT2UXBLRE -m comment --comment "kube-system/addon-http-application-routing-nginx-ingress:http loadbalancer IP" -j KUBE-MARK-DROP

Note that packets are not marked for NAT aka masquerading. Then it jumps to:

-A KUBE-XLB-PXYKE4WDT2UXBLRE -m comment --comment "Balancing rule 0 for kube-system/addon-http-application-routing-nginx-ingress:http" -j KUBE-SEP-EVDIGAVJIJH6RX2S

No probabilities, since we only have one pod (having at least 2 would be recommended for production scenarios). Finally it jumps to the SEP rules:

-A KUBE-SEP-EVDIGAVJIJH6RX2S -s 10.244.1.6/32 -m comment --comment "kube-system/addon-http-application-routing-nginx-ingress:http" -j KUBE-MARK-MASQ
-A KUBE-SEP-EVDIGAVJIJH6RX2S -p tcp -m comment --comment "kube-system/addon-http-application-routing-nginx-ingress:http" -m tcp -j DNAT --to-destination 10.244.1.6:80

For traffic to the pod, DNAT LB VIP to pod’s address (2nd entry). For traffic from the pod, mark for SNAT to the LB VIP (1st entry).

If we get into the nginx pod, we can get the configuration that has been injected to it. You can find more info about the nginx ingress troubleshooting here.

k -n kube-system exec -it addon-http-application-routing-nginx-ingress-controller-ffbbqj6 cat /etc/nginx/nginx.conf

I will put a fragment of the configuration here:

        upstream upstream_balancer {
                server 0.0.0.1; # placeholder
                balancer_by_lua_block {
                        balancer.balance()
                }
                keepalive 32;
        }
        ## start server whereami-ingress.bf52096edbd84ca1b3b6.westeurope.aksapp.io
        server {
                server_name whereami-ingress.bf52096edbd84ca1b3b6.westeurope.aksapp.io ;
                listen 80;
                set $proxy_upstream_name "-";
                location / {
                        set $namespace      "default";
                        set $ingress_name   "whereami-ingress";
                        set $service_name   "whereami-ingress";
                        set $service_port   "80";
                        set $location_path  "/";
                        ...
                        set $proxy_upstream_name "default-whereami-ingress-80";
                        ...
                        proxy_set_header X-Forwarded-For        $the_real_ip;
                        ...
                        proxy_http_version                      1.1;
                        proxy_pass http://upstream_balancer;
                }
        }
        ## end server whereami-ingress.bf52096edbd84ca1b3b6.westeurope.aksapp.io

There are two relevant parts to this:

  • The server: this is where nginx configures some parameters for the Web service, such as the rule to inject the X-Forwarded-For header we saw before. It defines a proxy_pass upstream load balancer, which is what will decided to which servers should requests be forwarded
  • The first part of the previous snippet contains the definition of the upstream load balancer aptly named upstream_balancer. Here is where the address of the application pods should be listed out, but instead of that we got a call to something named balancer_by_lua_block. What is this?

The frequent coming and going of pods in a kubernetes cluster would force to updating the nginx config file very frequently. To prevent this, the upstream server addresses (backend servers in ngnix jargon) are not defined in the config file, but dynamically obtained by an external LUA script. LUA is a scripting language with a tight integration into nginx.

You can see more details about this in these links:

You can actually have a look at the directory where LUA scripts are stored:

$ k -n kube-system exec -it addon-http-application-routing-nginx-ingress-controller-ffbbqj6 ls /etc/nginx/lua/balancer
chash.lua  ewma.lua  resty.lua  round_robin.lua  sticky.lua

Now let us see the endpoints of our application, whereami-ingress:

$ k get ep whereami-ingress
NAME               ENDPOINTS                      AGE
whereami-ingress   10.244.0.13:80,10.244.1.9:80   1h

If you look at the logs of the ingress pod, you should see how traffic is redirected to both IP addresses:

$ k -n kube-system logs pod/addon-http-application-routing-nginx-ingress-controller-ffbbqj6
...
109.125.120.58 - [109.125.120.58] - - [22/Jan/2019:10:17:23 +0000] "GET / HTTP/1.1" 200 1740 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36" 503 0.376 [default-whereami-ingress-80] 10.244.0.13:80 4666 0.376 2007db4e626c2149b6ac4213ba5335485cc
109.125.120.58 - [109.125.120.58] - - [22/Jan/2019:10:17:23 +0000] "GET /styles.css HTTP/1.1" 200 1052 "http://whereami-ingress.bf52096edbd84ca1b3b6.westeurope.aksapp.io/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36" 467 0.001 [default-whereami-ingress-80] 10.244.1.9:80 3310 0.000 200 9c33601d4506a3717945417b0517ad06
109.125.120.58 - [109.125.120.58] - - [22/Jan/2019:10:17:23 +0000] "GET /favicon.ico HTTP/1.1" 200 1150 "http://whereami-ingress.bf52096edbd84ca1b3b6.westeurope.aksapp.io/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36" 532 0.003 [default-whereami-ingress-80] 10.244.0.13:80 1150 0.004 200 5394d5ee2ea6822aacec60704ae049c8
109.125.120.58 - [109.125.120.58] - - [22/Jan/2019:10:26:10 +0000] "GET / HTTP/1.1" 200 4665 "-" "curl/7.58.0" 122 0.581 [default-whereami-ingress-80] 10.244.1.9:80 4665 0.576 200 acb473c5f0c792b53fd7c936478cd6ec
109.125.120.58 - [109.125.120.58] - - [22/Jan/2019:10:26:23 +0000] "GET / HTTP/1.1" 200 4666 "-" "curl/7.58.0" 122 0.319 [default-whereami-ingress-80] 10.244.0.13:80 4666 0.316 200 e86645fb39b6cf9ba6127386ead7f7f1
109.125.120.58 - [109.125.120.58] - - [22/Jan/2019:10:26:51 +0000] "GET / HTTP/1.1" 200 4665 "-" "curl/7.58.0" 122 0.321 [default-whereami-ingress-80] 10.244.1.9:80 4665 0.320 200 83512f96fded353d3f3de0da1b0fd17d
109.125.120.58 - [109.125.120.58] - - [22/Jan/2019:10:27:41 +0000] "GET / HTTP/1.1" 200 4666 "-" "curl/7.58.0" 122 0.296 [default-whereami-ingress-80] 10.244.0.13:80 4666 0.296 200 15c729ed77f5f4a385e1fe9ea02426d4
109.125.120.58 - [109.125.120.58] - - [22/Jan/2019:10:33:24 +0000] "GET / HTTP/1.1" 200 4666 "-" "curl/7.58.0" 122 0.380 [default-whereami-ingress-80] 10.244.0.13:80 4666 0.380 200 9207dfed715b6adab008540b9e36a070

We can look at the connection tracking table to see the two sessions (client-nginx and nginx-web). Let’s begin with the first one, and use the client’s public IP address for a change (so far we have been always using the destination IP):

jose@aks-nodepool1-31351229-1:~$ sudo conntrack -L -s 109.125.120.58
tcp 6 86386 ESTABLISHED src=109.125.120.58 dst=51.136.49.205 sport=63774 dport=80 src=10.244.1.6 dst=109.125.120.58 sport=80 dport=63774 [ASSURED] mark=0 use=1

As you can see, the source IP address 109.125.120.58 is not NATted, since the return traffic from the nginx pod is addressed to that very same address.

jose@aks-nodepool1-31351229-1:~$ sudo conntrack -L -d 10.244.1.6
...
tcp 6 1 TIME_WAIT src=10.244.1.1 dst=10.244.1.6 sport=40606 dport=10254 src=10.244.1.6 dst=10.244.1.1 sport=10254 dport=40606 [ASSURED] mark=0 use=1
...

If we have a look at the second leg of our connectivity, we see that there is no NAT involved, since we are seeing direct pod-to-pod communication:

jose@aks-nodepool1-31351229-1:~$ sudo conntrack -L -d 10.244.0.13
tcp 6 97 TIME_WAIT src=10.244.1.6 dst=10.244.0.13 sport=55710 dport=80 src=10.244.0.13 dst=10.244.1.6 sport=80 dport=55710 [ASSURED] mark=0 use=1

It is interesting to see as well that nginx does not use the ClusterIP for reaching the application pods, but their individual endpoint IP addresses. We investigated pod-to-pod communication earlier in this blog.

Load Balancing algorithms

As we have seen, there are multiple places in the architecture where some kind of load balancing takes place when using an ingress controller:

  1. The Azure Load Balancer will select one of the nodes containing nginx pods
  2. The selected node will select one of the nginx pods. Typically there would not be any load balancing here, since you should distribute the ingress controllers over different nodes in your network for better resiliency
  3. Finally, the nginx reverse-proxy would use its lua-based load balancer to select one of the pods containing the application

Which load balancing algorithms are used in each case?

The first one is the easiest: the algorithms for the Azure Load Balancer are well documented, and can be configured per load balancing rule:

$ az network lb rule list -g $noderg --lb-name $lb --query [].[name,loadDistribution] -o tsv
af6ce64281dbb11e9a6ba269b1ccf60c-TCP-80 Default
a4c1a6c4c1e1d11e99cfe8afbff07ff3-TCP-80 Default
a4c1a6c4c1e1d11e99cfe8afbff07ff3-TCP-443 Default

If you refer to the Azure LB documentation you can see that the default load distribution algorithm is a 5-tuple hash (source/destination address/port and protocol type).

The second one, which is typically more relevant when ingress controllers are not involved, is NAT based: iptables rules that have assigned a certain probability using the random mode (with the option, you guessed right, –mode random). This is a hash-based algorithm too, as opposed to the nth mode (–mode nth), which would implement a round robin distribution.

Here you can see again the load balancing example via iptables. As you can see, the first rule has a 50% probability of being chosen. If that is not the case, it will proceed to the second one. Not having a probability configured, the second rule will be chosen 100% of the remaining cases (that is, the remaining 50% of the times):

-A KUBE-SVC-7G2JV7LNOR6DDNIY -m comment --comment "default/whereami:" -m statistic --mode random --probability 0.50000000000 -j KUBE-SEP-KFF4FJCPYGQEXMDG
-A KUBE-SVC-7G2JV7LNOR6DDNIY -m comment --comment "default/whereami:" -j KUBE-SEP-2QV3R2JVLUZRDSSA

The last one is a bit trickier, since it is not that clear out of nginx documentation (or at least I did not find it). Normally you would find the load balancing algorithm in the upstream definition, but since the kubernetes implementation uses LUA scripts for distributing the load, there is not a ton of information there:

        upstream upstream_balancer {
                 server 0.0.0.1; # placeholder
                 balancer_by_lua_block {
                         balancer.balance()
                 }
                 keepalive 32;
         }

If no load balancing algorithm is specified, nginx defaults to round robin, but I am not sure if that applies as well for lua-based upstreams. The closest document I found is this one, where it is stated that nginx as ingress controller will use a round robin algorithm, unless an annotation for cookie affinity is set.

And that is all I did, I hope you learnt something today!

10 thoughts on “A Day in the Life of a Packet in AKS (part 2): kubenet and ingress controller

  1. […] Part 2 of this series focuses on the kubenet plugin (deployed to our own Vnet) and includes a scenario with an nginx ingress controller. […]

    Like

  2. […] kubenet? Let’s do a similar deployment as the one we just did for the Azure CNI. Refer to part 2 of this blog series for more details on how to deploy a kubenet AKS cluster in your own […]

    Like

  3. […] Part 2: deep dive in AKS with kubenet in your own vnet, and ingress controllers […]

    Like

  4. […] Part 2: deep dive in AKS with kubenet in your own vnet, and ingress controllers […]

    Like

  5. […] Part 2: deep dive in AKS with kubenet in your own vnet, and ingress controllers […]

    Like

  6. […] Part 2: deep dive in AKS with kubenet in your own vnet, and ingress controllers […]

    Like

  7. […] deploying your Azure CNI and kubenet clusters and a test VM (see part 1 and part 2 of this blog series), let us have a look first at Azure […]

    Like

  8. Mats Becker

    Hi,

    didn’t understandy fully how the Lua module / script “balancer_by_lua_block” with the lua scripts in “/etc/nginx/lua/balancer” get the node ips for the service name.

    The server definition in nginx.conf for an Ingress Ressource sets for example the the following variable: set $service_name “whereami-ingress”;
    I would say based on that variable the lua scripts identifies to which node ips the request is forwarded.

    Do you know how the node ip for an ingress / service is identified?

    Like

    1. Oops somehow i missed this… Sorry Mats, no nginx expert here 😦

      Like

  9. Hi,

    didn’t understandy fully how the Lua module / script “balancer_by_lua_block” with the lua scripts in “/etc/nginx/lua/balancer” get the node ips for the service name.

    The server definition in nginx.conf for an Ingress Ressource sets for example the the following variable: set $service_name “whereami-ingress”;
    I would say based on that variable the lua scripts identifies to which node ips the request is forwarded.

    Do you know how the node ip for an ingress / service is identified?

    Like

Leave a comment