A day in the life of a packet in Azure Redhat Openshift (part 3)

This is part 3 of a blog series around networking in Azure Redhat Openshift, and we will see how pods talk to each other inside of the cluster and to other systems in the virtual Network or on-premises. Other posts in the series:

In previous parts of this blog series we have seen how pods can talk between each other. Is that happening as well across project boundaries? Let’s find out, but the SDN plugin that we saw is used in Azure Redhat Openshift can give us some clues already.

Let’s start creating another API server in a different project, project2, that will try to access the SQL Server in project1:

# Variables
project_name=project2
# New project
oc new-project $project_name
sql_password=yoursupersecurepassword
# New app
oc new-app --docker-image erjosito/sqlapi:0.1 -e "SQL_SERVER_FQDN=server.project1.svc.cluster.local" -e "SQL_SERVER_USERNAME=sa" -e "SQL_SERVER_PASSWORD=${sql_password}"
# Exposing ClusterIP Svc over a route
oc expose svc sqlapi

Note that the FQDN for the SQL Server is still pointing to the one in project1, and that we have not deployed a SQL Server in project2. We can verify that our new API is up and running, and that it has connectivity to the SQL Server in project1:

curl "http://sqlapilb-project2.apps.m50kgrxk.northeurope.aroapp.io/api/healthcheck"
{
"health": "OK"
}
curl "http://sqlapilb-project2.apps.m50kgrxk.northeurope.aroapp.io/api/sqlversion"
{
"sql_output": "Microsoft SQL Server 2019 (RTM-CU4) (KB4548597) - 15.0.4033.1 (X64) \n\tMar 14 2020 16:10:35 \n\tCopyright (C) 2019 Microsoft Corporation\n\tDeveloper Edition (64-bit) on Linux (Ubuntu 18.04.4 LTS) "
}

There are two lessons that can be learnt here. The first one is that Azure Redhat Openshift uses the SDN plugin in network policy mode. This means that pods in different tenants (aka namespaces) per default can communicate with each other without any restriction. You can find more informations about the different modes for Openshift DNS in the documentation for Openshfit SDN. Actually we already saw in part 1 of this series a hint for this:

oc get clusternetworks.network.openshift.io -o yaml
apiVersion: v1
items:
- apiVersion: network.openshift.io/v1
  clusterNetworks:
  - CIDR: 10.128.0.0/14
    hostSubnetLength: 9
  hostsubnetlength: 9
  kind: ClusterNetwork
  metadata:
    creationTimestamp: "2020-05-27T06:10:34Z"
    generation: 1
    name: default
    ownerReferences:
    - apiVersion: operator.openshift.io/v1
      blockOwnerDeletion: true
      controller: true
      kind: Network
      name: cluster
      uid: da3cf28f-2ec6-4ccd-9c51-ffc0f5897be2
    resourceVersion: "1774"
    selfLink: /apis/network.openshift.io/v1/clusternetworks/default
    uid: c74b6a66-99ff-492d-90fa-a615a84c337e
  mtu: 1450
  network: 10.128.0.0/14
  pluginName: redhat/openshift-ovs-networkpolicy
  serviceNetwork: 172.30.0.0/16
  vxlanPort: 4789
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""

The second important corollary of this interproject communication is that DNS service discovery works across projects: the API pod in project “project2” could successfully resolve the FQDN “server.project1.svc.cluster.local”.

But what if we want to restrict communications, and prevent pods in one project from being accessed by pods in other projects? Good old Kubernetes Network Policy to the rescue.

As you can check in the Openshift documentation for Network Policy, there are many ways of restricting communication between pods. In this example we are going to apply a network policy to the SQL Server to only accept connections from its own namespace (project1):

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-from-project1
spec:
  ingress:
  - from:
    - podSelector: {}
  podSelector:
    matchLabels:
        app: server
  policyTypes:
  - Ingress

This policy will apply to all pods with the label app=server (like our SQL Server pod). Per default the “podSelector” in the “from” clause is scoped to the current namespace, so this policy allows all traffic from any pod in the same namespace.

If you try again to reach the SQL Server in project1 from project2, it will not work:

curl "http://sqlapilb-project2.apps.m50kgrxk.northeurope.aroapp.io/api/sqlversion"

An additional useful policy that you might want to include is to configure frontend pods (such as the API pod in our example) to be only accessible from the ingress controller. You can find an example of such a policy in the Openshift documentation for Network Policy.

Connectivity to the Virtual Network

So far we have covered traffic flows inside of the cluster and from Internet. What about the rest of the Virtual Network and on-premises networks? Let’s do that. To test this we will deploy a Virtual Machine in the same Virtual Network, but in a different subnet. There are many ways of deploying a virtual machine in Azure, my favorite is CLI:

vm_name=apivm
vm_nsg_name=${vm_name}-nsg
vm_pip_name=${vm_name}-pip
vm_disk_name=${vm_name}-disk0
vm_sku=Standard_B2ms
publisher=Canonical
offer=UbuntuServer
sku=18.04-LTS
image_urn=$(az vm image list -p $publisher -f $offer -s $sku -l $location --query '[0].urn' -o tsv)
az network vnet subnet create -n $vm_subnet_name --vnet-name $vnet_name -g $rg --address-prefixes $vm_subnet_prefix
ip-address testvm-pip --vnet-name $vnet_name --subnet $vm_subnet_name
az vm create -n $vm_name -g $rg -l $location --image $image_urn --size $vm_sku --generate-ssh-keys \
  --os-disk-name $vm_disk_name --os-disk-size-gb 32 \
  --vnet-name $vnet_name --subnet $vm_subnet_name \
  --nsg $vm_nsg_name --nsg-rule SSH --public-ip-address $vm_pip_name
vm_pip_ip=$(az network public-ip show -n $vm_pip_name -g $rg --query ipAddress -o tsv)
ssh-keyscan -H $vm_pip_ip >> ~/.ssh/known_hosts

The previous bash commands get the latest Ubuntu 18.04 image and deploy it in a new subnet in a virtual network. The script gets the allocated public IP and adds it to the known_hosts file so that we will be able to send commands to the VM. To verify that the VM has been deployed successfully, we can check its private IP address sending a remote command over SSH:

ssh $vm_pip_ip "ip a"                                 
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether 00:0d:3a:d8:21:fa brd ff:ff:ff:ff:ff:ff
    inet 192.168.0.101/28 brd 192.168.0.111 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::20d:3aff:fed8:21fa/64 scope link
       valid_lft forever preferred_lft forever

Let’s start with checking how the API sees us. In previous posts we have gone through a public Internet router, but if you remember, we had deployed an internal Load Balancer for our server as well:

oc get svc
NAME           TYPE           CLUSTER-IP       EXTERNAL-IP     PORT(S)          AGE
server         ClusterIP      172.30.193.92    <none>          1433/TCP         46m
sqlapi         ClusterIP      172.30.3.55      <none>          8080/TCP         46m
sqlapilb       LoadBalancer   172.30.165.130   192.168.0.11    8080:31192/TCP   46m

By the way, we might want to have a look at how this IP address is implemented. If you remember, there are some load balancers provisioned in the node resource group:

node_rg_id=$(az aro show -n $cluster_name -g $rg --query 'clusterProfile.resourceGroupId' -o tsv)
node_rg_name=$(echo $node_rg_id | cut -d/ -f 5)
az network lb list -g $node_rg_name -o table
Location     Name                    ProvisioningState    ResourceGroup    ResourceGuid
-----------  ----------------------  -------------------  ---------------  ------------------------------------
northeurope  aro2-p8bjm              Succeeded            aro2-resources   b1630a28-0e71-49ee-9e63-9c0d5edeaebc
northeurope  aro2-p8bjm-internal     Succeeded            aro2-resources   7521a4e5-19d4-428e-a2e3-d59370457abf
northeurope  aro2-p8bjm-internal-lb  Succeeded            aro2-resources   b01e0eb0-6035-4a61-9dd1-54642410c7ae
northeurope  aro2-p8bjm-public-lb    Succeeded            aro2-resources   5ec7a5b5-a6b9-4892-ba02-dc89acbe28ee

We are interested in the “-internal” one, that is the internal load balancer where the worker nodes are connected. To double check, let us verify the frontend IP addresses, we should see the private IP address of our service:

az network lb frontend-ip list --lb-name aro2-7lrgj-internal -g $node_rg_name -o table
Name                                   PrivateIpAddress    PrivateIpAddressVersion    PrivateIpAllocationMethod    ProvisioningState    ResourceGroup
-------------------------------------  ------------------  -------------------------  ---------------------------  -------------------  ---------------
a37227ba481534bb6aba9b048186900e       192.168.0.11        IPv4                       Dynamic                      Succeeded            aro2-resources
a22a55112e91348d48c6fcf87f4f1cca-apps  192.168.0.132       IPv4                       Dynamic                      Succeeded            aro2-resources

And there it is! One more thing, let us check the health probe that the Azure Load Balancer is using:

az network lb probe list --lb-name aro2-nvwgf-internal -g $node_rg_name -o table
IntervalInSeconds    Name                                       NumberOfProbes    Port    Protocol    ProvisioningState    ResourceGroup
-------------------  -----------------------------------------  ----------------  ------  ----------  -------------------  ---------------
5                    af73928d6b8954aac8024b76d833f652-TCP-8080  2                 31192   Tcp         Succeeded            aro2-resources

The important thing is that the probe is using the NodePort TCP port 31192, we will come back to this later. Now we can connect to the internal LB from the VM:

ssh $vm_pip_ip "curl -s http://192.168.0.11:8080/api/ip"        
{
  "my_default_gateway": "10.131.0.1",
  "my_dns_servers": "['172.30.0.10']",
  "my_private_ip": "10.131.0.40",
  "my_public_ip": "40.127.221.40",
  "path_accessed": "192.168.0.11:8080/api/ip",
  "sql_server_fqdn": "server.project1.svc.cluster.local",
  "sql_server_ip": "172.30.72.7",
  "x-forwarded-for": null,
  "your_address": "10.131.0.1",
  "your_browser": "None",
  "your_platform": "None"
}

This is interesting: the pod sees us coming from the 10.131.0.1, not from the original IP address from the Virtual Machine. But what is 10.131.0.1? If you remember Part 1, 10.131.0.0/23 is the IP address range that the Openshift SDN has allocated to the worker node where our pod is. Each node has internally a virtual router based on Open vSwitch that will act as default gateway for the pods, and that router performs Source NAT for inbound traffic from outside of the cluster. The reason why packets must be SNATted is because the load balancer does not actually know in which node the relevant pod is, so it will choose any one, and then the packet will find the right pod (possibly in a different node). Openshift SDN uses SNAT to guarantee that the return packet follows the same path.

Something interesting to note is that the X-Forwarded-For header is empty, since there is not any reverse proxy in the way. Hence the client IP information is not visible to the application. In some cases this is a serious problem, what can be done to fix this?

We will explore one solution in this post modifying the internal load balancer, and will leave another one for a future post (adding an internal router). If my previous explanation was half way understandable, the root cause of the problem is that the Azure Load Balancer’s health probe checks on the NodePort TCP port, which is active in all nodes, and hence the traffic can hit first a node that does not have the pod. Can we reconfigure the load balancer so that it only sends traffic to nodes actually hosting a relevant pod? Yes! We will change the service’s Extranl Traffic Policy to “Local”

oc edit svc/sqlapilb
spec:
  ...
  externalTrafficPolicy: Local
  ...

After doing that, let us verify the configuration of the Azure Load Balancer probes:

az network lb probe list --lb-name aro2-p8bjm-internal -g $node_rg_name -o table
IntervalInSeconds    Name                                       NumberOfProbes    Port    Protocol    ProvisioningState    RequestPath    ResourceGroup
-------------------  -----------------------------------------  ----------------  ------  ----------  -------------------  -------------  ---------------
5                    af73928d6b8954aac8024b76d833f652-TCP-8080  2                 32352   Http        Succeeded            /healthz       aro2-resources

There is an important difference: the probe now is not TCP, but HTTP, and it goes to a specific API and port on the Openshift node that tells whether there is any node for our application or not. As a consequence, the Azure Load Balancer will only send traffic to nodes containing relevant pods, and Source NAT will not be required any more. Let’s check from our VM again:

ssh $vm_pip_ip "curl -s http://192.168.0.11:8080/api/ip"
{
  "my_default_gateway": "",
  "my_dns_servers": "['172.30.0.10']",
  "my_private_ip": "10.129.2.12",
  "my_public_ip": "51.104.149.59",
  "path_accessed": "192.168.0.11:8080/api/ip",
  "sql_server_fqdn": "server.project1.svc.cluster.local",
  "sql_server_ip": "172.30.105.222",
  "x-forwarded-for": null,
  "your_address": "192.168.0.101",
  "your_browser": "None",
  "your_platform": "None"
}

And bingo! Not NAT involved any more, now the application sees the original client’s IP address.

Now that we are deep into the internal Azure Load Balancer, let’s try one more thing. You might have realized that the internal Load Balancer IP addresses come out of the worker nodes subnet. What if we exhaust the IP addresses there? You don’t want to be in a situation where you cannot scale or upgrade your cluster because of lack of IP addresses. Additionally, in certain situations you might want to whitelist the IP range of your LoadBalancer services, but exclude the nodes’ IP addresses. ARO has a feature for you: deploying internal LoadBalancer services in a dedicated subnet. This is controlled via an additional annotation, let’s create a new subnet and a LoadBalancer in that subnet:

ilb_subnet_name=apps
ilb_subnet_prefix=192.168.0.128/28
az network vnet subnet create -n $ilb_subnet_name --vnet-name $vnet_name -g $rg --address-prefixes $ilb_subnet_prefix
oc expose dc sqlapi --port 8080 --type=LoadBalancer --name=sqlapisubnet --dry-run -o yaml | awk '1;/metadata:/{ print "  annotations:\n    service.beta.kubernetes.io/azure-load-balancer-internal: \"true\"\n    service.beta.kubernetes.io/azure-load-balancer-internal-subnet: \"'${ilb_subnet_name}'\"" }' | oc create -f -

As you can see, the previous command introduces two annotations: “service.beta.kubernetes.io/azure-load-balancer-internal” to signal that the Load Balancer will be internal, and “service.beta.kubernetes.io/load-balancer-internal-subnet” to specify the subnet where the ALB will be deployed. Let’s check it out!

oc get svc
NAME           TYPE           CLUSTER-IP       EXTERNAL-IP     PORT(S)          AGE
server         ClusterIP      172.30.105.222   <none>          1433/TCP         34m
sqlapi         ClusterIP      172.30.215.232   <none>          8080/TCP         34m
sqlapilb       LoadBalancer   172.30.62.57     192.168.0.11    8080:31011/TCP   24m
sqlapisubnet   LoadBalancer   172.30.171.194   192.168.0.132   8080:31072/TCP   2m9s

And that is it, the new service called “sqlapisubnet” has been deployed with the IP address 192.168.0.132, what is the first allocatable IP address in the “apps” subnet 192.168.0.128/28.

This concludes this post, in the next part we will have a look at Azure Private Link and DNS. Thanks for reading!

4 thoughts on “A day in the life of a packet in Azure Redhat Openshift (part 3)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: