After a good while without posting anything, I finally decided to slowly recommence again. This first post is about a little BGP trick that may help you increase the scale of Azure Route Server. Typically the maximum number of 8 BGP peers should be enough for most designs, but if you happen to need to go beyond that limit, this post will offer you an escape route (pun intended). We will start with a quick recap of Azure Route Server (ARS from now on), then describe the problem at hand, and how to solve it. Spoiler alert, the solution may remind you of BGP confederations, especially if you ever used a BGP command similar to “next hop preserve”. Here we go!
What was Azure Route Server again?
Every solution needs a problem, so let’s see what was the problem again that ARS is solving. It is actually two previously existing issues in Azure. The first one was the lack of a dynamic interface to add routes to NICs. As you might remember from my old post Azure Networking is not like your onprem network if you happen to have read it, most Azure networking features are executed and enforced in Azure NICs (Network Interface Cards, but nobody ever spells that out). Let’s look into the following construct:

You have two VMs: VM1 and VM2. Behind VM2 there is a network which is not part of the VNet. In my lab it is just a loopback interface with the IP address 100.0.0.100/32 defined inside of the VM, but it could be a whole SDWAN range, point-to-site VPN users, or many other things, as long as the IP range does not belong to the VNet.
The important thing to notice here is that traffic from VM1 to VM2 always traverses VM1’s NIC, NIC1. If NIC1 doesn’t know about the destination range (100.0.0.100/32 in the example), it will drop the traffic.
Azure beginners fail to realize this, and they think that with a route in VM1’s operating system pointing to VM2 should be enough. They forget that Azure NICs are fully functional routers, and all routers in the chain need to know how to reach the final destination.
So how can you tell NIC1 about this extraneous network? You have two options: either you use static User-Defined Routes (UDRs), or if you need something more dynamic for example for high availability, you can advertise that range to an Azure Route Server that is located in the same VNet. You see, Azure Route Server will always try to program the routes it learns via BGP on each and every NIC of its local (and directly peered) VNets. So the picture would now look like this:

ARS has learnt about the existence of 100.0.0.100/32 via BGP from VM2, and it will program a route with VM2 as next hop to this network in every NIC it can reach, in this example NIC1 and NIC2. NIC1 now does know how to reach 100.0.0.100/32.
NIC2 will receive the traffic from NIC1, but it will not do any routing lookup, and instead deliver all packets to its attached virtual machine, VM2 in this example.
There is a second problem that ARS fixes, and it looks similar to the above. If you have these networks about which the VNet doesn’t know anything about, often times it is not enough to let the NICs in your VNet know about them, but you need to inform your on-premises environment too. How do you do that? If you are using BGP-enabled VPN or ExpressRoute gateways, ARS will also advertise to them the networks it knows about:

“Only” 8 BGP peerings!
Imagine now you have not only one VM that needs to propagate prefixes either into the VNet or to a BGP-enabled gateway, but more. You can add additional BGP peerings to Azure Route Server, until you reach its maximum: eight. For example, take the following scenario:

You have 10 Network Virtual Appliances (NVAs), each providing access to some prefixes. It is quite unlikely to come to this number, but it could be possible. For example, because you have multiple types of NVAs for different purposes such as North/South or East/West firewalls, SDWAN, point-to-site VPN, etc. Or because you have multiple instances of the same NVAs for scalability reasons, or maybe because you need to separate those appliances because they serve different customers or partners. Or maybe because the solution you are implementing requires a high number of BGP peerings.
The latter is the situation I was confronted with by a colleague: with Nutanix clusters on Azure, a Nutanix component originally called “BGP VM” can create up to eight BGP peerings with Azure Route Server (you can read this excellent blog post from Jonas Werner for more details about it). If on top of that you have SDWAN NVAs in your architecture, you are in trouble.
What to do then? Finally, we are done with the intro: welcome to the main content of this post.
Next hop unchanged
The solution starts with a typical architecture to many scale problems: you introduce hierarchical aggregation. Translating from fancy words into English: you can have some intermediate “BGP hubs” that consolidate the BGP peerings from your NVAs before handling the routes to the route server, as this diagram shows:

However, there is a problem with this design: the BGP adjacencies between the “BGP hubs” and ARS are eBGP, because their Autonomous System Numbers (ASNs) are different (you can check https://www.bgp.us/ibgp-and-ebgp/ for more details on the differences between iBGP and eBGP). As a consequence, the default behavior of the BGP hubs is to set themselves as next hop in the routes sent over to ARS. This usually makes sense, however in this case we want the routes in NIC1 to point to the final NVA (NVA0, NVA1, etc.), not to the BGP hubs.
Is it possible to override the default eBGP behavior and preserve the original next hop that each NVA advertised to the BGP hubs? The answer is yes, as it is often the case with a protocol as flexible as BGP. A command exists in many mature BGP implementations that does exactly that. In Cisco it is called “next hop unchanged“, in Juniper “no-nexthop-change” or “no-nexthop-self”, in BIRD (what I personally use) is called “next hop keep”. For other NVAs I would suggest to check with your vendor.
Does it work?
Like a charm! Of course I needed to test this. I configured two BGP hubs for redundancy, although you can certainly have more, since this is a critical service. If you are using Linux VMs as I do (Ubuntu 22.04 with the package “bird” installed), you could even use the VMSS-based setup with health probes and automatic self-healing that I described in the Azure Firewall’s sidekick to join the BGP superheroes post. You can see that the BGP hubs are BGP-peered to both the two Azure Route Server instances and to each of the ten NVAs:
jose@bgppeer:~$ sudo birdc 'show prot' BIRD 1.6.8 ready. name proto table state since info device1 Device master up 15:15:00 direct1 Direct master down 15:15:00 kernel1 Kernel master down 15:15:00 static1 Static master up 15:15:00 nva0 BGP master up 15:15:05 Established nva1 BGP master up 15:15:05 Established nva2 BGP master up 15:17:31 Established nva3 BGP master up 15:24:36 Established nva4 BGP master up 15:27:39 Established nva5 BGP master up 15:34:30 Established nva6 BGP master up 15:42:49 Established nva7 BGP master up 15:44:54 Established nva8 BGP master up 15:47:32 Established nva9 BGP master up 15:49:44 Established rs0 BGP master up 15:15:05 Established rs1 BGP master up 15:15:05 Established
If we have a look at the Azure Route Server, it is only peered to the two BGP hubs:
❯ az network routeserver peering list --routeserver $ars_name -g $rg -o table Name PeerAsn PeerIp ProvisioningState ResourceGroup ------ --------- ----------- ------------------- --------------- peer1 65000 10.13.76.84 Succeeded routeserver peer2 65000 10.13.76.85 Succeeded routeserver
Notice how the BGP hubs have the IP addresses 10.13.76.84 and 10.13.76.85. However, if we look at the learned routes, we can see that the next hop is the actual NVA’s IP address (10.13.76.100 through 10.13.76.109):
❯ az network routeserver peering list-learned-routes --routeserver $ars_name -g $rg --query 'RouteServiceRole_IN_0' -o table -n peer1 AsPath LocalAddress Network NextHop Origin SourcePeer Weight ----------- -------------- -------------- ------------ -------- ------------ -------- 65000-65100 10.13.76.4 100.0.0.100/32 10.13.76.100 EBgp 10.13.76.84 32768 65000-65101 10.13.76.4 100.0.0.101/32 10.13.76.101 EBgp 10.13.76.84 32768 65000-65102 10.13.76.4 100.0.0.102/32 10.13.76.102 EBgp 10.13.76.84 32768 65000-65103 10.13.76.4 100.0.0.103/32 10.13.76.103 EBgp 10.13.76.84 32768 65000-65104 10.13.76.4 100.0.0.104/32 10.13.76.104 EBgp 10.13.76.84 32768 65000-65105 10.13.76.4 100.0.0.105/32 10.13.76.105 EBgp 10.13.76.84 32768 65000-65106 10.13.76.4 100.0.0.106/32 10.13.76.106 EBgp 10.13.76.84 32768 65000-65107 10.13.76.4 100.0.0.107/32 10.13.76.107 EBgp 10.13.76.84 32768 65000-65108 10.13.76.4 100.0.0.108/32 10.13.76.108 EBgp 10.13.76.84 32768 65000-65109 10.13.76.4 100.0.0.109/32 10.13.76.109 EBgp 10.13.76.84 32768
And sure enough, inspecting the effective routes in any virtual machine in the virtual network will show the correct routes being installed:
❯ az network nic show-effective-route-table -n vmVMNic -g $rg -o table Source State Address Prefix Next Hop Type Next Hop IP --------------------- ------- ---------------- --------------------- ------------- Default Active 10.13.76.0/24 VnetLocal VirtualNetworkGateway Active 100.0.0.101/32 VirtualNetworkGateway 10.13.76.101 VirtualNetworkGateway Active 100.0.0.100/32 VirtualNetworkGateway 10.13.76.100 VirtualNetworkGateway Active 100.0.0.102/32 VirtualNetworkGateway 10.13.76.102 VirtualNetworkGateway Active 100.0.0.108/32 VirtualNetworkGateway 10.13.76.108 VirtualNetworkGateway Active 100.0.0.103/32 VirtualNetworkGateway 10.13.76.103 VirtualNetworkGateway Active 100.0.0.107/32 VirtualNetworkGateway 10.13.76.107 VirtualNetworkGateway Active 100.0.0.104/32 VirtualNetworkGateway 10.13.76.104 VirtualNetworkGateway Active 100.0.0.106/32 VirtualNetworkGateway 10.13.76.106 VirtualNetworkGateway Active 100.0.0.105/32 VirtualNetworkGateway 10.13.76.105 VirtualNetworkGateway Active 100.0.0.109/32 VirtualNetworkGateway 10.13.76.109 Default Active 0.0.0.0/0 Internet
By the way, don’t let the source VirtualNetworkGateway confuse you. The mechanism with which Route Server “injects” its routes in the NICs is the same one as ExpressRoute and VPN gateways and even Virtual WAN, so the same source type appears for these dynamically programmed routes.
The last thing is testing the data plane from the VM: give me one ping, Vasili! Although here I am testing with ten pings per NVA, not just with one. The following output only shows the first four ones with no packet loss, but believe me, all of them work fine (Jedi hand move):
jose@vm:~$ for i in $(seq 0 9); do ping -q 100.0.0.10$i -c 10; done PING 100.0.0.100 (100.0.0.100) 56(84) bytes of data. --- 100.0.0.100 ping statistics --- 10 packets transmitted, 10 received, 0% packet loss, time 9012ms rtt min/avg/max/mdev = 1.032/1.492/3.338/0.632 ms PING 100.0.0.101 (100.0.0.101) 56(84) bytes of data. --- 100.0.0.101 ping statistics --- 10 packets transmitted, 10 received, 0% packet loss, time 9013ms rtt min/avg/max/mdev = 1.054/1.300/1.782/0.213 ms PING 100.0.0.102 (100.0.0.102) 56(84) bytes of data. --- 100.0.0.102 ping statistics --- 10 packets transmitted, 10 received, 0% packet loss, time 9026ms rtt min/avg/max/mdev = 0.822/1.273/1.620/0.258 ms PING 100.0.0.103 (100.0.0.103) 56(84) bytes of data. --- 100.0.0.103 ping statistics --- 10 packets transmitted, 10 received, 0% packet loss, time 9012ms rtt min/avg/max/mdev = 1.043/1.607/2.417/0.478 ms ...
Reusing NVAs
OK, you are not too fond of maintaining Linux and BIRD yourself, I hear you say. But you happen to have some appliances in your VNet that support BGP and already peer with Azure Route Server. Could you repurpose those as “BGP hubs” or “BGP aggregators” for other NVAs?
Yes, but you would have to be careful here: these NVAs should set themselves as next hop for the prefixes they are responsible for, but preserve the original next hop for prefixes they get from other NVAs in the VNet. For example, consider the expansion of our previous example where now the BGP hubs are also NVAs that need to inject certain prefixes on their own:

In this case, the BGP hubs should preserve the next-hop in the updates that come from the NVAs on the right (the blue prefixes 100.0.0.100-109, which have ASNs 65100 through 65109), but they should put themselves as next-hop for the prefixes they are responsible for (the green prefix 100.0.0.200 in the diagram above). This means that the configuration cannot be made at the neighbor level, but instead with a route map. For example, a possible configuration for a Cisco device might be as follows:
router bgp 65000 bgp log-neighbor-changes neighbor 10.13.76.4 remote-as 65515 neighbor 10.13.76.4 ebgp-multihop 2 neighbor 10.13.76.5 remote-as 65515 neighbor 10.13.76.5 ebgp-multihop 2 ! address-family ipv4 neighbor 10.13.76.4 activate neighbor 10.13.76.4 route-map ToARS out neighbor 10.13.76.5 activate neighbor 10.13.76.5 route-map ToARS out exit address-family ! route-map ToARS permit 10 match as-path 1 set ip next-hop unchanged route-map ToARS permit 20 ! as-path access-list 1 permit ^6510[0-9]
The previous configuration does the following:
- Peer with each of the two ARS instances (10.13.76.4 and 10.13.76.5, ASN 65515).
- For each of the two ARS instances apply the route-map ToARS in the outgoing direction.
- For prefixes matching the AS-path list 1 (which matches everything coming from ASNs 65100-65109), preserve the original next-hop.
- For everything else, use the default behavior (which for eBGP is using next-hop self).
Disclaimer: I haven’t had the time of actually testing the configuration above, the main goal is showing the concept: you do next-hop self for some routes, next-hop preserver for others. Of course, for other NVA vendors the required configuration will be different, or it might not even be possible (for example, I didn’t find out how to do it with my beloved BIRD 1.6).
This way you don’t need to add extra appliances to your design, and your existing NVAs can take over the role of “BGP hub” or “BGP aggregator”.
Conclusion
If your design happens to need more BGP neighbors than what Azure Route Server supports, worry not: there is a workaround. You need to verify whether your NVAs support a “next hop keep” sort of functionality, and you should be good to go. However, before going this way, my suggestion would be to think hard whether your design is not overly complex, since usually the ARS limit of 8 peers is enough for most architectures.
Did I forget to mention anything? Do you disagree on any point? Please let me know in the comments below!

Hi Jose,I’ve always appreciated your insightful and inspiring blog posts such a pleasure to see a new update after a while! Your latest piece truly made my day.
I had a quick technical question:If iBGP is established between the ARS and the BGP Hub, and eBGP is configured between the BGP Hub and the NVAs, wouldn’t that naturally resolve the nexthop issue?From what I understand, iBGP peers retain the original nexthop when advertising routes, so the BGP Hub would preserve the ARS’s nexthop.Please feel free to correct me if I’m off base apologies if this is a basic question.
Wishing you a great day, and looking forward to your next post as always !!!!
LikeLike
Hey Jaeok, thanks for your kind words! In theory you are right, but I always thought that ARS only supports eBGP. I have never tried configuring an iBGP peer, not sure if it even works.
LikeLike