A Day in the Life of a Packet in Azure Kubernetes Service (part 1): Azure CNI

I have been often troubleshooting networking inside of Azure Kubernetes Service (AKS) multiple times, so prompted by a colleague I decided to do a deep dive into the way packets are forwarded. Turned out I have learnt quite a lot! In this blog I will describe how to check every step of the way in an AKS cluster using the Azure CNI plugin (let’s see if I find the time to do the same with the kubenet plugin). It is going to go pretty deep, so fasten your seat belts!

  • Part 1 (this post): deep dive in AKS with Azure CNI in your own vnet
  • Part 2: deep dive in AKS with kubenet in your own vnet, and ingress controllers
  • Part 3: outbound connectivity from AKS pods
  • Part 4: NSGs with Azure CNI cluster
  • Part 5: Virtual Node
  • Part 6: Network Policy with Azure CNI

Getting Started

In order to test I have created an AKS cluster in a existing Vnet with advanced networking (aka Azure CNI plugin):

rg=akstest
vnet=aksVnet
subnet=aks
aksname=aksPacketWalkAzure
az group create -n $rg -l westeurope
az network vnet create -g $rg -n $vnet --address-prefix 10.13.0.0/16 --subnet-name $subnet --subnet-prefix 10.13.76.0/24
subnetid=$(az network vnet subnet show -g $rg --vnet-name $vnet -n $subnet --query id -o tsv)
az aks create -g $rg -n $aksname -c 1 --generate-ssh-keys -s Standard_B2ms -k 1.11.5 --network-plugin azure --vnet-subnet-id $subnetid
az aks get-credentials -g $rg -n $aksname

For troubleshooting we will need a VM that we will use as jump host. Let us put it in a new subnet in the same Vnet.

subnet=vmsubnet
admin_password=$uper$ecretPassw0rd
admin_user=lab-user
az network vnet subnet create -g $rg-n $subnet --vnet-name $vnet --address-prefix 10.13.1.0/24
az vm create --image ubuntults -g $rg -n testvm --admin-password $admin_password --admin-username $admin_user --public-ip-address testvm-pip --vnet-name $vnet --subnet $subnet --os-disk-size 30 --storage-sku Standard_LRS --no-wait

When the VM is ready, you will be able to SSH to its public IP address. In order to connect to the AKS nodes, you will need to generate a public/private SSH key pair. I typically use the one in my laptop, which you can add to the Kubernetes nodes:

aksname=aksPacketWalkAzure
local_user=lab-user
noderg=$(az aks show -g $rg -n $aksname --query nodeResourceGroup -o tsv)
nodename=$(az vm list -g $noderg --query [0].name -o tsv)
az vm user update -g $noderg -n $nodename --username $local_user --ssh-key-value ~/.ssh/id_rsa.pub

After having your key in the kubernetes nodes, you can connect over your jump host, for example using the -J option in ssh (see here for more details).

Deploy an app

First thing we need to do is finding out the Resource Group where AKS infrastructure (such as the node VMs and the Load Balancers) are deployed:

aksname=aksPacketWalkAzure
noderg=$(az aks show -g akstest -n $aksname --query nodeResourceGroup -o tsv)

I have deployed a simple service consisting of 2 pods and a LoadBalancer service:

apiVersion: apps/v1beta1
kind: Deployment
metadata:
  name: whereami
spec:
  replicas: 2
  strategy:
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 1
  minReadySeconds: 5
  template:
    metadata:
      labels:
        app: whereami
    spec:
      containers:
      - name: whereami
        image: erjosito/whereami:1.3
        ports:
        - containerPort: 80
---
apiVersion: v1
kind: Service
metadata:
  name: whereami
spec:
  type: LoadBalancer
  ports:
  - port: 80
  selector:
    app: whereami

Let’s have a look at the items that have been created:

$ k get svc
NAME         TYPE           CLUSTER-IP    EXTERNAL-IP      PORT(S)        AGE
kubernetes   ClusterIP      10.0.0.1      <none>           443/TCP        24m
whereami     LoadBalancer   10.0.159.58   51.144.176.251   80:30064/TCP   11m

$ k get ep whereami
NAME       ENDPOINTS                      AGE
whereami   10.13.76.26:80,10.13.76.7:80   37m

$ k get pod -o wide
NAME                       READY     STATUS    RESTARTS   AGE       IP            NODE                       NOMINATED NODE
whereami-564765b89-j7bpw   1/1       Running   0          46m       10.13.76.7    aks-nodepool1-31351229-0   <none>
whereami-564765b89-qfq2k   1/1       Running   0          46m       10.13.76.26   aks-nodepool1-31351229-0   <none>lb=$(az network lb list -g $noderg -o tsv --query [0].name)

Azure resources

As you can see, the service is of type LoadBalancer. That means that there should be an Azure Load Balancer in our resource group. Let’s have a look at it:

lb=$(az network lb list -g $noderg -o tsv --query [0].name)
az network lb rule list -g $noderg --lb-name $lb -o table
BackendPort    EnableFloatingIp    EnableTcpReset    FrontendPort    IdleTimeoutInMinutes    LoadDistribution    Name                                     Protocol    ProvisioningState    ResourceGroup
-------------  ------------------  ----------------  --------------  ----------------------  ------------------  ---------------------------------------  ----------  -------------------  ----------------------------------------
80             True                False             80              4                       Default             a48b4c0e11cfd11e9981c7abfb60d882-TCP-80  Tcp         Succeeded            MC_akstest_aksPacketWalkAzure_westeurope

az network lb probe list -g $noderg --lb-name $lb -o table
IntervalInSeconds    Name                                     NumberOfProbes    Port    Protocol    ProvisioningState    ResourceGroup
-------------------  ---------------------------------------  ----------------  ------  ----------  -------------------  ----------------------------------------
5                    a48b4c0e11cfd11e9981c7abfb60d882-TCP-80  2                 30064   Tcp         Succeeded            MC_akstest_aksPacketWalkAzure_westeurope

As you can see, the probe is monitoring the NodePort’s TCP port, not port 80. The probes are configured for the best supported response time: they are sent every 5 seconds, and will flag endpoints as down after 2 failures.

Something important to note is the EnableFloatingIP (also known as Direct Server Return). This setting will make the load balancer not to replace the Virtual IP address with the real destination IP. This will be very relevant later.

Let’s verify the IP addresses (a single one in this case, since our cluster only has one node) in the backend pool:

az network lb address-pool list -g $noderg --lb-name $lb -o table
az network lb address-pool list -g $noderg --lb-name $lb --query [].backendIpConfigurations[].id -o tsv
/subscriptions/e7da9914-9b05-4891-893c-546cb7b0422e/resourceGroups/MC_akstest_aksPacketWalkAzure_westeurope/providers/Microsoft.Network/networkInterfaces/aks-nodepool1-31351229-nic-0/ipConfigurations/ipconfig1

az network nic ip-config show -g $noderg --nic-name aks-nodepool1-31351229-nic-0 -n ipconfig1 --query privateIpAddress -o tsv
10.13.76.4

iptables

Alright, we know how packets arrive to the node. There they will be handled by iptables, let’s have a look at the config there. At this point you need to go to your jump host, and connect to the kubernetes node:

jose@aks-nodepool1-31351229-0:~$ sudo iptables-save | grep whereami
-A KUBE-NODEPORTS -p tcp -m comment --comment "default/whereami:" -m tcp --dport 30064 -j KUBE-MARK-MASQ
-A KUBE-NODEPORTS -p tcp -m comment --comment "default/whereami:" -m tcp --dport 30064 -j KUBE-SVC-7G2JV7LNOR6DDNIY
-A KUBE-SERVICES -d 51.144.176.251/32 -p tcp -m comment --comment "default/whereami: loadbalancer IP" -m tcp --dport 80 -j KUBE-FW-7G2JV7LNOR6DDNIY
-A KUBE-SERVICES ! -s 10.13.76.0/24 -d 10.0.159.58/32 -p tcp -m comment --comment "default/whereami: cluster IP" -m tcp --dport 80 -j KUBE-MARK-MASQ
-A KUBE-SERVICES -d 10.0.159.58/32 -p tcp -m comment --comment "default/whereami: cluster IP" -m tcp --dport 80 -j KUBE-SVC-7G2JV7LNOR6DDNIY
-A KUBE-FW-7G2JV7LNOR6DDNIY -m comment --comment "default/whereami: loadbalancer IP" -j KUBE-MARK-MASQ
-A KUBE-FW-7G2JV7LNOR6DDNIY -m comment --comment "default/whereami: loadbalancer IP" -j KUBE-SVC-7G2JV7LNOR6DDNIY
-A KUBE-FW-7G2JV7LNOR6DDNIY -m comment --comment "default/whereami: loadbalancer IP" -j KUBE-MARK-DROP
-A KUBE-SVC-7G2JV7LNOR6DDNIY -m comment --comment "default/whereami:" -m statistic --mode random --probability 0.50000000000 -j KUBE-SEP-6HBOEI5FVFTJNRJ3
-A KUBE-SVC-7G2JV7LNOR6DDNIY -m comment --comment "default/whereami:" -j KUBE-SEP-IJTMGMPNVALZGJZD
-A KUBE-SEP-6HBOEI5FVFTJNRJ3 -s 10.13.76.26/32 -m comment --comment "default/whereami:" -j KUBE-MARK-MASQ
-A KUBE-SEP-6HBOEI5FVFTJNRJ3 -p tcp -m comment --comment "default/whereami:" -m tcp -j DNAT --to-destination 10.13.76.26:80
-A KUBE-SEP-IJTMGMPNVALZGJZD -s 10.13.76.7/32 -m comment --comment "default/whereami:" -j KUBE-MARK-MASQ
-A KUBE-SEP-IJTMGMPNVALZGJZD -p tcp -m comment --comment "default/whereami:" -m tcp -j DNAT --to-destination 10.13.76.7:80

Alright, there is a bunch to process here. Notice the first KUBE-SERVICES rule, matching on a destination IP 51.144.176.251 and port 80:

-A KUBE-SERVICES -d 51.144.176.251/32 -p tcp -m comment --comment "default/whereami: loadbalancer IP" -m tcp --dport 80 -j KUBE-FW-7G2JV7LNOR6DDNIY

Remember we said we had Floating IP, aka Direct Server Return, configured in our Azure LB rule? As a consequence, we see here the public VIP. The target of this rule is the iptables chain KUBE-SVC-7G2JV7LNOR6DDNIY. This chain has a couple of rules:

-A KUBE-FW-7G2JV7LNOR6DDNIY -m comment --comment "default/whereami: loadbalancer IP" -j KUBE-MARK-MASQ
-A KUBE-FW-7G2JV7LNOR6DDNIY -m comment --comment "default/whereami: loadbalancer IP" -j KUBE-SVC-7G2JV7LNOR6DDNIY
-A KUBE-FW-7G2JV7LNOR6DDNIY -m comment --comment "default/whereami: loadbalancer IP" -j KUBE-MARK-DROP

The first rule marks the packet for Masquerading (iptables naming convention for source NAT). Marking a packet is a ‘non-terminating’ rule in iptables. That means that further rules in the chain are processed.

The second one has a target of KUBE-SVC-7G2JV7LNOR6DDNIY. Note that there is a third rule that would mark the packets to be dropped, should the previous rule not hit any terminating rule. Let’s have a look at this target:

-A KUBE-SVC-7G2JV7LNOR6DDNIY -m comment --comment "default/whereami:" -m statistic --mode random --probability 0.50000000000 -j KUBE-SEP-6HBOEI5FVFTJNRJ3
-A KUBE-SVC-7G2JV7LNOR6DDNIY -m comment --comment "default/whereami:" -j KUBE-SEP-IJTMGMPNVALZGJZD

These are the EndPoint rules, you will see here as many as endpoints in your service. Note that there is a probability associated to each endpoint, this is how iptables load balances the traffic. Finally, let’s have a look at the first of those endpoint chains:

-A KUBE-SEP-6HBOEI5FVFTJNRJ3 -s 10.13.76.26/32 -m comment --comment "default/whereami:" -j KUBE-MARK-MASQ
-A KUBE-SEP-6HBOEI5FVFTJNRJ3 -p tcp -m comment --comment "default/whereami:" -m tcp -j DNAT --to-destination 10.13.76.26:80

The first rule will mark return traffic coming from the pod to be masqueraded (NATted), the second is what actually redirects the traffic to the corresponding endpoint (pod), in this case the one with IP address 10.13.76.26.

Let us have a look at what this port marking is about:

jose@aks-nodepool1-31351229-0:~$ sudo iptables -t nat -L KUBE-MARK-MASQ
Chain KUBE-MARK-MASQ (19 references)
target     prot opt source               destination
MARK       all  --  anywhere             anywhere             MARK or 0x4000

jose@aks-nodepool1-31351229-0:~$ sudo iptables -L -t nat -v | grep -i masquerade
    0     0 MASQUERADE  all  --  any    !docker0  172.17.0.0/16        anywhere
 3260  203K MASQUERADE  all  --  any    any     anywhere            !10.0.0.0/8           destination IP range ! 168.63.129.16-168.63.129.16 ADDRTYPE match dst-type !LOCAL
    4   208 MASQUERADE  all  --  any    any     anywhere             anywhere             /* kubernetes service traffic requiring SNAT */ mark match 0x4000/0x4000

As you can see, the KUBE-MARK-MASQ rule sets a mark doing a logical OR with 0x4000 (one bit) with the packet marking. This marking will be then checked by the masquerading to SNAT the traffic

After running a connection to the VIP from your browser, let us have a look at the connection list:

jose@aks-nodepool1-31351229-0:~$ sudo conntrack -L -d 51.144.176.251
tcp      6 3577 CLOSE_WAIT src=109.125.120.58 dst=51.144.176.251 sport=51535 dport=80 src=10.13.76.7 dst=10.13.76.4 sport=80 dport=51535 [ASSURED] mark=0 use=1
tcp      6 3577 CLOSE_WAIT src=109.125.120.58 dst=51.144.176.251 sport=51536 dport=80 src=10.13.76.26 dst=10.13.76.4 sport=80 dport=51536 [ASSURED] mark=0 use=1
conntrack v1.4.3 (conntrack-tools): 2 flow entries have been shown.

109.125.120.58 is the public IP on my laptop, that is the source IP of the packets. Let’s take the first entry and analyze what it is saying: each flow contains two packet descriptions, one with the packet as it enters the node (before anything being translated), as well as the answer from the pod:

  • Packets come in from outside with the public IP address of the client as source (109.125.120.58) and the public IP address of the VIP in the load balancer as destination (51.144.176.251, remember our discussion about floating IP aka DIrect Server Return).
  • The pod’s answer will have as source its IP address (10.13.76.7), and as destination 10.13.76.4. This is telling us that the original client’s IP (109.125.120.58) is not visible to the pod, it replies to the node’s address (10.13.76.4).

Bridges, eth and veth

If you wanted to confirm that traffic is being NATted, you could run a traffic capture on the interfaces connected to the pod. But wait, how to find out that interface? First, you need to find out a couple of things: the ID of the docker container (we will use the first one, since we have two pods), and the PID associated with it:

jose@aks-nodepool1-31351229-0:~$ sudo docker ps | grep whereami
cd299e883674        erjosito/whereami            "/bin/sh -c '/usr/sb…"   About an hour ago   Up About an hour                        k8s_whereami_whereami-564765b89-qfq2k_default_48a93ac3-1cfd-11e9-981c-7abfb60d882f_0
64cee2b89ac6        erjosito/whereami            "/bin/sh -c '/usr/sb…"   About an hour ago   Up About an hour                        k8s_whereami_whereami-564765b89-j7bpw_default_48a73a70-1cfd-11e9-981c-7abfb60d882f_0
jose@aks-nodepool1-31351229-0:~$ sudo docker inspect --format '{{ .State.Pid }}' cd299e883674
29634

Now you can connect to the network namespace in that PID, and run any command. Let us verify that this one has the IP address of the pod:

jose@aks-nodepool1-31351229-0:~$ sudo nsenter -t 29634 -n ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
19: eth0@if20: <BROADCAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether ee:af:f8:66:59:bd brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 10.13.76.26/24 scope global eth0
       valid_lft forever preferred_lft forever

Looks good! Notice the if20 in the name of hte eth0 interface. This means that the interface is piped to interface number 20 in the node. Let us find out which one is the interface number 20:

jose@aks-nodepool1-31351229-0:~$ ip a | grep 20:
20: azva4124965a9b@if19: <BROADCAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master azure0 state UP group default qlen 1000

There you go, if you wanted to make a packet capture of traffic entering or leaving the pod, you could run a tcpdump on interface azva4124965a9b (we will do it later in this chapter).

We can do one more verification, to illustrate how networking in the Azure CNI works. For that, we need to install the brctl util, which happens to be included in the bridge-utils software package:

jose@aks-nodepool1-31351229-0:~$ sudo apt install -y bridge-utils

...

jose@aks-nodepool1-31351229-0:~$ brctl show
bridge name     bridge id               STP enabled     interfaces
azure0          8000.000d3a28bd02       no              azv022045195f6
                                                        azv1a68ecb6fb1
                                                        azv3cd1b2db354
                                                        azv4abdd7fa2da
                                                        azv686acbd5570
                                                        azv8d9152d33d2
                                                        azva4124965a9b
                                                        azvc03d8409c59
                                                        eth0
docker0         8000.0242406081ac       no

As you can see, we have two bridges configured in the system. One is the well known docker0 bridge. The other one is more interesting, and it is used by the Azure CNI plugin. The azure0 interface is actually where the node’s IP address is configured.

Bridge ports have a number, we need the port number for our azva4124965a9b interface. With that port number, we can see the MAC addresses learnt from that port:

jose@aks-nodepool1-31351229-0:~$ sudo brctl showstp azure0 | grep azva4124965a9b
azva4124965a9b (9)

jose@aks-nodepool1-31351229-0:~$ sudo brctl showmacs azure0 | grep -E "\s+9\s+"
  9     8a:4b:a8:d0:5c:24       yes                0.00
  9     8a:4b:a8:d0:5c:24       yes                0.00
  9     ee:af:f8:66:59:bd       no                 2.56

As you can see, the pod’s MAC addresss (we saw it when connecting to its network namespace) is learnt over this bridge port, so we can be sure this is the right interface. There is another way to see the MAC address of our pod:

jose@aks-nodepool1-31351229-0:~$ ping 10.13.76.26
PING 10.13.76.26 (10.13.76.26) 56(84) bytes of data.
64 bytes from 10.13.76.26: icmp_seq=1 ttl=64 time=0.049 ms
64 bytes from 10.13.76.26: icmp_seq=2 ttl=64 time=0.072 ms
64 bytes from 10.13.76.26: icmp_seq=3 ttl=64 time=0.044 ms
^C
--- 10.13.76.26 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2037ms
rtt min/avg/max/mdev = 0.044/0.055/0.072/0.012 ms

jose@aks-nodepool1-31351229-0:~$ arp -a 10.13.76.26
? (10.13.76.26) at ee:af:f8:66:59:bd [ether] PERM on azure0

You can do captures either on eth0 (to see traffic entering/leaving the node) or on azva4124965a9b (to see traffic entering/leaving the pod). Let’s capture traffic on eth0 as example:

jose@aks-nodepool1-31351229-0:~$ sudo tcpdump -i eth0 port 80
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
23:38:56.006768 IP 10.13.76.4.50358 > 168.63.129.16.http: Flags [S], seq 2243581835, win 29200, options [mss 1460,sackOK,TS val 1927742031 ecr 0,nop,wscale 7], length 0
23:38:56.007231 IP 168.63.129.16.http > 10.13.76.4.50358: Flags [S.], seq 3358221709, ack 2243581836, win 8192, options [mss 1460,nop,wscale 8,sackOK,TS val 560363995 ecr 1927742031], length 0
23:38:56.007265 IP 10.13.76.4.50358 > 168.63.129.16.http: Flags [.], ack 1, win 229, options [nop,nop,TS val 1927742031 ecr 560363995], length 0
23:38:56.007307 IP 10.13.76.4.50358 > 168.63.129.16.http: Flags [P.], seq 1:199, ack 1, win 229, options [nop,nop,TS val 1927742031 ecr 560363995], length 198: HTTP: GET /machine/?comp=goalstate HTTP/1.1
23:38:56.008010 IP 168.63.129.16.http > 10.13.76.4.50358: Flags [FP.], seq 1:2392, ack 199, win 1026, options [nop,nop,TS val 560363996 ecr 1927742031], length 2391: HTTP: HTTP/1.1 200 OK
23:38:56.008102 IP 10.13.76.4.50358 > 168.63.129.16.http: Flags [.], ack 2393, win 266, options [nop,nop,TS val 1927742032 ecr 560363996], length 0
23:38:56.008521 IP 10.13.76.4.50358 > 168.63.129.16.http: Flags [F.], seq 199, ack 2393, win 266, options [nop,nop,TS val 1927742032 ecr 560363996], length 0

As you can see, there is a lot of traffic coming from our Azure Load Balancer probes, let’s filter it out in our tcpdump command:

 

jose@aks-nodepool1-31351229-0:~$ sudo tcpdump -n -i eth0 port 80 and not host 168.63.129.16
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
23:40:31.788475 IP 109.125.120.58.62538 > 51.144.176.251.80: Flags [S], seq 1755954558, win 64240, options [mss 1420,nop,wscale 8,nop,nop,sackOK], length 0
23:40:31.788549 IP 51.144.176.251.80 > 109.125.120.58.62538: Flags [S.], seq 1844749018, ack 1755954559, win 29200, options [mss 1460,nop,nop,sackOK,nop,wscale 7], length 0
23:40:31.820718 IP 109.125.120.58.62538 > 51.144.176.251.80: Flags [.], ack 1, win 515, length 0
...

As before, 109.125.120.58 is my source IP. Note how the public IP is received in the interface (remember DSR aka floating IP).

externalTrafficPolicy=local

Let’s try now the externalTrafficPolicy=local property of LoadBalancer-type services. This will make the source IP address visible in the pod. I have destroyed my setup, and deployed a new yaml file:

apiVersion: apps/v1beta1
kind: Deployment
metadata:
  name: whereami
spec:
  replicas: 2
  strategy:
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 1
  minReadySeconds: 5
  template:
    metadata:
      labels:
        app: whereami
    spec:
      containers:
      - name: whereami
        image: erjosito/whereami:1.3
        ports:
        - containerPort: 80
---
apiVersion: v1
kind: Service
metadata:
  name: whereami
spec:
  type: LoadBalancer
  externalTrafficPolicy: "Local"
  ports:
  - port: 80
  selector:
    app: whereami

Let’s have a look at the items deployed, very similar to our previous example:

$ k get svc
NAME         TYPE           CLUSTER-IP   EXTERNAL-IP    PORT(S)        AGE
kubernetes   ClusterIP      10.0.0.1     <none>         443/TCP        10h
whereami     LoadBalancer   10.0.12.96   13.93.66.174   80:32663/TCP   3m

$ k get ep whereami
NAME       ENDPOINTS                       AGE
whereami   10.13.76.15:80,10.13.76.30:80   3m

Let us check the connection table (after you generate some real traffic from your browser):

jose@aks-nodepool1-31351229-0:~$ sudo conntrack -L -d 13.93.66.174
tcp      6 295 ESTABLISHED src=167.220.196.180 dst=13.93.66.174 sport=57675 dport=80 src=10.13.76.15 dst=167.220.196.180 sport=80 dport=57675 [ASSURED] mark=0 use=1
tcp      6 296 ESTABLISHED src=167.220.196.180 dst=13.93.66.174 sport=38332 dport=80 src=10.13.76.30 dst=167.220.196.180 sport=80 dport=38332 [ASSURED] mark=0 use=1
conntrack v1.4.3 (conntrack-tools): 2 flow entries have been shown.

In this case, there are two connections, both from my new source IP address 167.220.196.180 (I changed laptops in the meantime). Note however how for the return traffic (the second pair of IPs) we can see now the virtual IP not being translated.

Let us double-check capturing at the pod (we did not do this in the previous section). I will take the second pod this time:

jose@aks-nodepool1-31351229-0:~$ sudo docker ps | grep whereami
b0370d1f082e        c8e4ff7df026                 "/bin/sh -c '/usr/sb…"   11 minutes ago      Up 11 minutes                           k8s_whereami_whereami-564765b89-jpkgg_default_96d2aa7c-1d54-11e9-981c-7abfb60d882f_0
0352b1445446        c8e4ff7df026                 "/bin/sh -c '/usr/sb…"   11 minutes ago      Up 11 minutes                           k8s_whereami_whereami-564765b89-nqgz9_default_96d0d44e-1d54-11e9-981c-7abfb60d882f_0

jose@aks-nodepool1-31351229-0:~$ sudo docker inspect --format '{{ .State.Pid }}' 0352b1445446
11232

jose@aks-nodepool1-31351229-0:~$ sudo nsenter -t 11232 -n ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
21: eth0@if22: <BROADCAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether aa:08:fc:c4:9c:f8 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 10.13.76.30/24 scope global eth0
       valid_lft forever preferred_lft forever

jose@aks-nodepool1-31351229-0:~$ ip a | grep 22:
22: azv47c3ee9d512@if21: <BROADCAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master azure0 state UP group default qlen 1000

jose@aks-nodepool1-31351229-0:~$ sudo tcpdump -n -i azv47c3ee9d512 port 80 and not host 168.63.129.16 and not host 169.254.169.254
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on azv47c3ee9d512, link-type EN10MB (Ethernet), capture size 262144 bytes
08:31:20.206214 IP 167.220.197.180.13829 > 10.13.76.30.80: Flags [S], seq 3186291510, win 64240, options [mss 1300,nop,wscale 8,nop,nop,sackOK], length 0
08:31:20.206278 IP 10.13.76.30.80 > 167.220.197.180.13829: Flags [S.], seq 218333007, ack 3186291511, win 29200, options [mss 1460,nop,nop,sackOK,nop,wscale 7], length 0
08:31:20.235738 IP 167.220.197.180.13829 > 10.13.76.30.80: Flags [.], ack 1, win 512, length 0

As you can see, in the capture the original IP is shown. Let us have a look at how the iptables configuration is different:

jose@aks-nodepool1-31351229-0:~$ sudo iptables-save | grep whereami | grep KUBE-SERVICES
-A KUBE-SERVICES ! -s 10.13.76.0/24 -d 10.0.12.96/32 -p tcp -m comment --comment "default/whereami: cluster IP" -m tcp --dport 80 -j KUBE-MARK-MASQ
-A KUBE-SERVICES -d 10.0.12.96/32 -p tcp -m comment --comment "default/whereami: cluster IP" -m tcp --dport 80 -j KUBE-SVC-7G2JV7LNOR6DDNIY
-A KUBE-SERVICES -d 13.93.66.174/32 -p tcp -m comment --comment "default/whereami: loadbalancer IP" -m tcp --dport 80 -j KUBE-FW-7G2JV7LNOR6DDNIY

No change in the KUBE-SERVICES chain, as before no marking is done. Let us look at the target chain KUBE-FW-7G2JV7LNOR6DDNIY:

jose@aks-nodepool1-31351229-0:~$ sudo iptables-save | grep whereami | grep KUBE-FW-7G2JV7LNOR6DDNIY
-A KUBE-FW-7G2JV7LNOR6DDNIY -m comment --comment "default/whereami: loadbalancer IP" -j KUBE-XLB-7G2JV7LNOR6DDNIY
-A KUBE-FW-7G2JV7LNOR6DDNIY -m comment --comment "default/whereami: loadbalancer IP" -j KUBE-MARK-DROP
-A KUBE-SERVICES -d 13.93.66.174/32 -p tcp -m comment --comment "default/whereami: loadbalancer IP" -m tcp --dport 80 -j KUBE-FW-7G2JV7LNOR6DDNIY

As you can see in the FW rule, the KUBE-MARK-MASQ target has disappeared! This is what makes the packets not being source NATted. We have now this XLB chain, that sits before the SEP chains:

-A KUBE-XLB-7G2JV7LNOR6DDNIY -m comment --comment "Balancing rule 0 for default/whereami:" -m statistic --mode random --probability 0.50000000000 -j KUBE-SEP-SGJO6RZWDBFIJSTC
-A KUBE-XLB-7G2JV7LNOR6DDNIY -m comment --comment "Balancing rule 1 for default/whereami:" -j KUBE-SEP-YJDPB5HJYN7OEYYV

-A KUBE-SEP-SGJO6RZWDBFIJSTC -s 10.13.76.15/32 -m comment --comment "default/whereami:" -j KUBE-MARK-MASQ
-A KUBE-SEP-SGJO6RZWDBFIJSTC -p tcp -m comment --comment "default/whereami:" -m tcp -j DNAT --to-destination 10.13.76.15:80
-A KUBE-SEP-YJDPB5HJYN7OEYYV -s 10.13.76.30/32 -m comment --comment "default/whereami:" -j KUBE-MARK-MASQ
-A KUBE-SEP-YJDPB5HJYN7OEYYV -p tcp -m comment --comment "default/whereami:" -m tcp -j DNAT --to-destination 10.13.76.30:80


The rest of the rules do not need to change:

jose@aks-nodepool1-31351229-0:~$ sudo iptables-save | grep MASQUERADE
-A POSTROUTING -s 172.17.0.0/16 ! -o docker0 -j MASQUERADE
-A POSTROUTING ! -d 10.0.0.0/8 -m iprange ! --dst-range 168.63.129.16-168.63.129.16 -m addrtype ! --dst-type LOCAL -j MASQUERADE
-A KUBE-POSTROUTING -m comment --comment "kubernetes service traffic requiring SNAT" -m mark --mark 0x4000/0x4000 -j MASQUERADE

jose@aks-nodepool1-31351229-0:~$ sudo iptables -t nat -L KUBE-MARK-MASQ
Chain KUBE-MARK-MASQ (18 references)
target     prot opt source               destination
MARK       all  --  anywhere             anywhere             MARK or 0x4000

jose@aks-nodepool1-31351229-0:~$ sudo iptables-save | grep MASQUERADE
-A POSTROUTING -s 172.17.0.0/16 ! -o docker0 -j MASQUERADE
-A POSTROUTING ! -d 10.0.0.0/8 -m iprange ! --dst-range 168.63.129.16-168.63.129.16 -m addrtype ! --dst-type LOCAL -j MASQUERADE
-A KUBE-POSTROUTING -m comment --comment "kubernetes service traffic requiring SNAT" -m mark --mark 0x4000/0x4000 -j MASQUERADE

The mysterious probe

Now let’s have a look at the Azure load balancer. Something must be different there, since the Azure Load Balancer should only send packets to nodes containing target pods. The load balancing rule is the same:

$ az network lb rule list -g $noderg --lb-name $lb -o table
BackendPort    EnableFloatingIp    EnableTcpReset    FrontendPort    IdleTimeoutInMinutes    LoadDistribution    Name                                     Protocol    ProvisioningState    ResourceGroup
-------------  ------------------  ----------------  --------------  ----------------------  ------------------  ---------------------------------------  ----------  -------------------  ----------------------------------------
80             True                False             80              4                       Default             a96e1c7f91d5411e9981c7abfb60d882-TCP-80  Tcp         Succeeded            MC_akstest_aksPacketWalkAzure_westeurope

But the probe is now a HTTP probe! And it is monitoring a TCP port completely different from the service port!

$ az network lb probe list -g $noderg --lb-name $lb -o table
IntervalInSeconds    Name                                     NumberOfProbes    Port    Protocol    ProvisioningState    RequestPath    ResourceGroup
-------------------  ---------------------------------------  ----------------  ------  ----------  -------------------  -------------  ----------------------------------------
5                    a96e1c7f91d5411e9981c7abfb60d882-TCP-80  2                 32364   Http        Succeeded            /healthz       MC_akstest_aksPacketWalkAzure_westeurope

We can verify that the backend address pool has not changed, and that the load balancing rule is targeting the nodes’ IP addresses:

$ az network lb address-pool list -g $noderg --lb-name $lb --query [].backendIpConfigurations[].id -o tsv
/subscriptions/e7da9914-9b05-4891-893c-546cb7b0422e/resourceGroups/MC_akstest_aksPacketWalkAzure_westeurope/providers/Microsoft.Network/networkInterfaces/aks-nodepool1-31351229-nic-0/ipConfigurations/ipconfig1

But where is the port 32364 defined? And the /healthz endpoint? Let us try it out, to see what we get:

jose@aks-nodepool1-31351229-0:~$ curl localhost:32364/healthz
{
        "service": {
                "namespace": "default",
                "name": "whereami"
        },
        "localEndpoints": 2
}

Interesting… However, there is no service in Kubernetes configured to answer on port 32364:

$ k get svc --all-namespaces
NAMESPACE     NAME                   TYPE           CLUSTER-IP    EXTERNAL-IP    PORT(S)         AGE
default       kubernetes             ClusterIP      10.0.0.1      <none>         443/TCP         11h
default       whereami               LoadBalancer   10.0.12.96    13.93.66.174   80:32663/TCP    42m
kube-system   heapster               ClusterIP      10.0.187.18   <none>         80/TCP          11h
kube-system   kube-dns               ClusterIP      10.0.0.10     <none>         53/UDP,53/TCP   11h
kube-system   kubernetes-dashboard   ClusterIP      10.0.103.10   <none>         80/TCP          11h
kube-system   metrics-server         ClusterIP      10.0.95.138   <none>         443/TCP         11h

So it looks like we are going to have to find out the hard way. Let us discover which PID is listening on that port, and which Docker container is associated to that PID:

jose@aks-nodepool1-31351229-0:~$ sudo netstat -l -p | grep 32364
tcp6       0      0 [::]:32364              [::]:*                  LISTEN      4372/hyperkube

jose@aks-nodepool1-31351229-0:~$ sudo docker ps -q | xargs sudo docker inspect --format '{{.State.Pid}}, {{.ID}}, {{.Name}}' | grep "^4372,"
4372, 8d6c75b6842ed2be7cd2db6f397ce2a090492fb9603d7a687aa150b734f6ebc0, /k8s_kube-proxy_kube-proxy-l6th7_kube-system_007d4629-1cfc-11e9-981c-7abfb60d882f_0

Interesting, this starts to make sense: kube-proxy offers a /healthz endpoint that tells the load balancer whether there are any pods running in a particular node for a given service. This endpoint is specific to each kubernetes service. If we create a new service, for example with nginx, we will see a new probe configured in the load balance, pointing to a different TCP port served by kube-proxy:

$ k apply -f ./nginx-elb.yaml
deployment.apps/nginx created
service/nginx created

$ az network lb probe list -g $noderg --lb-name $lb -o table
IntervalInSeconds    Name                                     NumberOfProbes    Port    Protocol    ProvisioningState    RequestPath    ResourceGroup
-------------------  ---------------------------------------  ----------------  ------  ----------  -------------------  -------------  ----------------------------------------
5                    a96e1c7f91d5411e9981c7abfb60d882-TCP-80  2                 32364   Http        Updating             /healthz       MC_akstest_aksPacketWalkAzure_westeurope
5                    a158189b11d5e11e9981c7abfb60d882-TCP-80  2                 31322   Http        Updating             /healthz       MC_akstest_aksPacketWalkAzure_westeurope

jose@aks-nodepool1-31351229-0:~$ curl localhost:32364/healthz
{
        "service": {
                "namespace": "default",
                "name": "whereami"
        },
        "localEndpoints": 2
}

jose@aks-nodepool1-31351229-0:~$ curl localhost:31322/healthz
{
        "service": {
                "namespace": "default",
                "name": "nginx"
        },
        "localEndpoints": 2
}

This concludes my investigations for today. I hope I could show you how the internal networking in AKS works, specifically with the Azure CNI plugin.

 

9 thoughts on “A Day in the Life of a Packet in Azure Kubernetes Service (part 1): Azure CNI

  1. […] again, to complete the previous post on the Azure CNI, here it goes using kubenet instead. To make it a bit more interesting we are going to explore a […]

    Like

  2. […] deploying your Azure CNI and kubenet clusters and a test VM (see part 1 and part 2 of this blog series), let us have a look first at Azure […]

    Like

  3. […] Part 1: deep dive in AKS with Azure CNI in your own vnet […]

    Like

  4. […] Part 1: deep dive in AKS with Azure CNI in your own vnet […]

    Like

  5. […] Part 1: deep dive in AKS with Azure CNI in your own vnet […]

    Like

  6. […] look into this for a while now, and I finally found a good excuse to do it. You might have read my series of posts on AKS networking, the goal of this is doing something similar with Azure Redhat Openshift […]

    Like

  7. […] is no better way than deploying a sample workload and run some tests with it: for example for the “A day in the life of a packet on AKS” series that I wrote some years ago when AKS was something new I needed an application with which I could […]

    Like

Leave a comment