Today I was looking at IPvlan on a docker container in Azure along a colleague, and we found that there are plenty of documentation and blogs out there that might be confusing when running this setup on Azure.
What is this IPvlan thing, I hear you ask? Docker has a good explanation here, but let me offer you my short version: the host will just forward received packets with a certain destination IP to a given container. Period.

This is not switching or routing, but much simpler. As a consequence, IPvlan (and its older sister MACvlan) are very popular for applications that require very low latency, such as Network Function Virtualization. In this post we will have a look on how to make this work. Spoiler alert: the building blocks of the solutions are either secondary IP configurations in the NIC or a combination of UDRs and SNAT.
Containers in the host subnet
Let’s think of a first topology where the containers will be in the same subnet as the host:

To provision this, you will need a host with a container runtime. I am running on Ubuntu 18.04 LTS (yes, old habits die hard) and Docker 23.0.6. Creating the IPvlan network is pretty straight forward:
jose@docker01:~$ sudo docker network create --driver ipvlan --subnet 10.0.0.0/24 --opt parent=eth0 --opt ipvlan_mode=l3 ipvlan_hostnet jose@docker01:~$ sudo docker inspect ipvlan_hostnet [ { "Name": "ipvlan_hostnet", "Id": "d2d606aa32a9bf3073908727005eb883059dc59d4157150aae1016311236fd32", "Created": "2023-05-09T15:23:36.943208769Z", "Scope": "local", "Driver": "ipvlan", "EnableIPv6": false, "IPAM": { "Driver": "default", "Options": {}, "Config": [ { "Subnet": "10.0.0.0/24" } ] }, "Internal": false, "Attachable": false, "Ingress": false, "ConfigFrom": { "Network": "" }, "ConfigOnly": false, "Containers": {} }, "Options": { "ipvlan_mode": "l3", "parent": "eth0" }, "Labels": {} } ]
Note how the IPvlan network is created in L3 mode, since L2 mode wouldn’t work in Azure (Azure SDN doesn’t replicate the L2 semantics of traditional Ethernet networks). Now that we have the IPvlan network, creating a container is not hard. You only need to specify the IP address matching the right subnet. I am using the Alpine image because it has most of the networking tools that you need to troubleshoot:
jose@docker01:~$ sudo docker run -d --net=ipvlan_hostnet --ip=10.0.0.10 --name container3 alpine sh -c 'while sleep 3600; do :; done'
If we tried to ping now something outside of the network, it wouldn’t work. The reason is because Azure doesn’t know that the IP address 10.0.0.10
is located in the VM. One way of informing Azure about this is creating a secondary configuration in the VM’s NIC with the IP address of the container:
❯ az network nic ip-config create --nic-name docker01VMNic -n container3 -g $rg --vnet-name ipvlan --subnet docker --private-ip-address 10.0.0.10 --make-primary false
Let’s verify that the VM got its new private IP address:
❯ az vm list-ip-addresses -g $rg -o table VirtualMachine PublicIPAddresses PrivateIPAddresses ---------------- ------------------- -------------------- docker01 20.223.163.208 10.0.0.4,10.0.0.10 testvm 20.238.58.100 10.0.1.4
Our container should have full connectivity to the public Internet, as well as to the private VNets and onprem connections. You can test this by entering into the container with the docker exec
command, and pinging 8.8.8.8
for public Internet connectivity, and 10.0.1.4
(another VM in the same subnet) for internal connectivity:
jose@docker01:~$ sudo docker exec -it container3 /bin/sh / # ping -c 5 8.8.8.8 PING 8.8.8.8 (8.8.8.8): 56 data bytes 64 bytes from 8.8.8.8: seq=0 ttl=103 time=1.806 ms 64 bytes from 8.8.8.8: seq=1 ttl=103 time=1.789 ms 64 bytes from 8.8.8.8: seq=2 ttl=103 time=10.655 ms 64 bytes from 8.8.8.8: seq=3 ttl=103 time=2.133 ms 64 bytes from 8.8.8.8: seq=4 ttl=103 time=2.294 ms --- 8.8.8.8 ping statistics --- 5 packets transmitted, 5 packets received, 0% packet loss round-trip min/avg/max = 1.789/3.735/10.655 ms / # ping -c 5 10.0.1.4 PING 10.0.1.4 (10.0.1.4): 56 data bytes 64 bytes from 10.0.1.4: seq=0 ttl=64 time=1.098 ms 64 bytes from 10.0.1.4: seq=1 ttl=64 time=1.548 ms 64 bytes from 10.0.1.4: seq=2 ttl=64 time=1.390 ms 64 bytes from 10.0.1.4: seq=3 ttl=64 time=1.613 ms 64 bytes from 10.0.1.4: seq=4 ttl=64 time=1.031 ms --- 10.0.1.4 ping statistics --- 5 packets transmitted, 5 packets received, 0% packet loss round-trip min/avg/max = 1.031/1.336/1.613 ms
Note that in the destination VM you will see the actual address of the container, the packets are traversing the Azure SDN without being translated in any way:
jose@testvm:~$ sudo tcpdump -n host 10.0.0.10 tcpdump: verbose output suppressed, use -v or -vv for full protocol decode 13:24:39.895619 IP 10.0.0.10 > 10.0.1.4: ICMP echo request, id 64, seq 0, length 64 13:24:39.895649 IP 10.0.1.4 > 10.0.0.10: ICMP echo reply, id 64, seq 0, length 64
This setup is very similar to what the Azure CNI plugin can do (see Deploy container networking for a standalone Docker host for instructions on how to install it on a Docker machine), but with the simplicity and latency improvements of IPvlan.
Containers in different subnets from the host
Let’s do something different now. What about containers in a completely separate IP address space? I am thinking on something like this:

Let’s delete our container, our IPvlan network, and create a new one along two new containers. Note how our new IPvlan network contains two subnets 192.168.1.0/24
and 192.168.2.0/24
, completely different from the subnet 10.0.0.0/24
where the host is located:
sudo docker stop container3 sudo docker rm container3 sudo docker network rm ipvlan_hostnet sudo docker network create --driver ipvlan --subnet 192.168.1.0/24 --subnet 192.168.2.0/24 --opt parent=eth0 --opt ipvlan_mode=l3 ipvlan_net sudo docker run -d --net=ipvlan_net --ip=192.168.1.10 --name container1 alpine sh -c 'while sleep 3600; do :; done' sudo docker run -d --net=ipvlan_net --ip=192.168.2.10 --name container2 alpine sh -c 'while sleep 3600; do :; done'
You could try to do the same trick as in the last section with secondary IP configurations, but unfortunately it will not work:
❯ az network nic ip-config create --nic-name docker01VMNic -n container3 -g $rg --vnet-name ipvlan --subnet container1 --private-ip-address 192.168.1.10 --make-primary false (IpConfigurationsOnSameNicCannotUseDifferentSubnets) IPConfigurations on a Nic /subscriptions/blahblah/docker01VMNic cannot belong to different subnets.
What can we do now? Well, there is a way of teaching Azure where to reach certain IP addresses: with static routes (also known as User Defined Routes). We will create static routes in the Test VM and point it to to the docker host IP address:

We can try to ping the containers from each other, as well as the Test VM:
/ # ip a 1: lo: mtu 65536 qdisc noqueue state UNKNOWN qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever 18: eth0@if2: mtu 1500 qdisc noqueue state UNKNOWN link/ether 00:0d:3a:d8:24:1d brd ff:ff:ff:ff:ff:ff inet 192.168.1.10/24 brd 192.168.1.255 scope global eth0 valid_lft forever preferred_lft forever / # ping -c 5 192.168.2.10 PING 192.168.2.10 (192.168.2.10): 56 data bytes 64 bytes from 192.168.2.10: seq=0 ttl=64 time=0.077 ms 64 bytes from 192.168.2.10: seq=1 ttl=64 time=0.098 ms 64 bytes from 192.168.2.10: seq=2 ttl=64 time=0.091 ms 64 bytes from 192.168.2.10: seq=3 ttl=64 time=0.088 ms 64 bytes from 192.168.2.10: seq=4 ttl=64 time=0.094 ms --- 192.168.2.10 ping statistics --- 5 packets transmitted, 5 packets received, 0% packet loss round-trip min/avg/max = 0.077/0.089/0.098 ms / # ping -c 5 10.0.1.4 PING 10.0.1.4 (10.0.1.4): 56 data bytes 64 bytes from 10.0.1.4: seq=0 ttl=64 time=0.988 ms 64 bytes from 10.0.1.4: seq=1 ttl=64 time=1.098 ms 64 bytes from 10.0.1.4: seq=2 ttl=64 time=1.051 ms 64 bytes from 10.0.1.4: seq=3 ttl=64 time=1.186 ms 64 bytes from 10.0.1.4: seq=4 ttl=64 time=1.109 ms --- 10.0.1.4 ping statistics --- 5 packets transmitted, 5 packets received, 0% packet loss round-trip min/avg/max = 0.988/1.086/1.186 ms
What about Internet? If the packet goes out the Docker host with 192.168.1.10
as source, how does Azure SDN know where to route the packets back? There is no Internet Gateway in Azure (as in AWS), so you cannot assign a route table “to the Internet”. Your first approach might be to SNAT the egress packets to the Internet, but that would break IPvlan: return traffic would carry the translated IP address, and IPvlan wouldn’t be able to map it to the right container.
However, SNAT is the right approach: you need a route table in the Docker host sending Internet traffic to the Test VM, which is more acting as an NVA. This test VM is then SNATting the traffic before sending it to the Internet:

/ # ping -c 5 8.8.8.8 PING 8.8.8.8 (8.8.8.8): 56 data bytes 64 bytes from 8.8.8.8: seq=0 ttl=103 time=1.825 ms 64 bytes from 8.8.8.8: seq=1 ttl=103 time=2.019 ms 64 bytes from 8.8.8.8: seq=2 ttl=103 time=1.963 ms 64 bytes from 8.8.8.8: seq=3 ttl=103 time=2.374 ms 64 bytes from 8.8.8.8: seq=4 ttl=103 time=5.536 ms --- 8.8.8.8 ping statistics --- 5 packets transmitted, 5 packets received, 0% packet loss round-trip min/avg/max = 1.825/2.743/5.536 ms
Conclusion
If your goal is a very lean networking stack in a Docker host (or podman, or any other container runtime), IPvlan would be a good candidate. Depending on your design (whether the pods/containers will share the same subnet with the host or not), you can solve this challenge with secondary IP configurations in the Azure NIC, or a combination of UDRs and SNAT in a Network Virtual Appliance.
Hope this helps!