Going beyond 8 peers in Azure Route Server

After a good while without posting anything, I finally decided to slowly recommence again. This first post is about a little BGP trick that may help you increase the scale of Azure Route Server. Typically the maximum number of 8 BGP peers should be enough for most designs, but if you happen to need to go beyond that limit, this post will offer you an escape route (pun intended). We will start with a quick recap of Azure Route Server (ARS from now on), then describe the problem at hand, and how to solve it. Spoiler alert, the solution may remind you of BGP confederations, especially if you ever used a BGP command similar to “next hop preserve”. Here we go!

What was Azure Route Server again?

Every solution needs a problem, so let’s see what was the problem again that ARS is solving. It is actually two previously existing issues in Azure. The first one was the lack of a dynamic interface to add routes to NICs. As you might remember from my old post Azure Networking is not like your onprem network if you happen to have read it, most Azure networking features are executed and enforced in Azure NICs (Network Interface Cards, but nobody ever spells that out). Let’s look into the following construct:

You have two VMs: VM1 and VM2. Behind VM2 there is a network which is not part of the VNet. In my lab it is just a loopback interface with the IP address 100.0.0.100/32 defined inside of the VM, but it could be a whole SDWAN range, point-to-site VPN users, or many other things, as long as the IP range does not belong to the VNet.

The important thing to notice here is that traffic from VM1 to VM2 always traverses VM1’s NIC, NIC1. If NIC1 doesn’t know about the destination range (100.0.0.100/32 in the example), it will drop the traffic.

Azure beginners fail to realize this, and they think that with a route in VM1’s operating system pointing to VM2 should be enough. They forget that Azure NICs are fully functional routers, and all routers in the chain need to know how to reach the final destination.

So how can you tell NIC1 about this extraneous network? You have two options: either you use static User-Defined Routes (UDRs), or if you need something more dynamic for example for high availability, you can advertise that range to an Azure Route Server that is located in the same VNet. You see, Azure Route Server will always try to program the routes it learns via BGP on each and every NIC of its local (and directly peered) VNets. So the picture would now look like this:

ARS has learnt about the existence of 100.0.0.100/32 via BGP from VM2, and it will program a route with VM2 as next hop to this network in every NIC it can reach, in this example NIC1 and NIC2. NIC1 now does know how to reach 100.0.0.100/32.

NIC2 will receive the traffic from NIC1, but it will not do any routing lookup, and instead deliver all packets to its attached virtual machine, VM2 in this example.

There is a second problem that ARS fixes, and it looks similar to the above. If you have these networks about which the VNet doesn’t know anything about, often times it is not enough to let the NICs in your VNet know about them, but you need to inform your on-premises environment too. How do you do that? If you are using BGP-enabled VPN or ExpressRoute gateways, ARS will also advertise to them the networks it knows about:

“Only” 8 BGP peerings!

Imagine now you have not only one VM that needs to propagate prefixes either into the VNet or to a BGP-enabled gateway, but more. You can add additional BGP peerings to Azure Route Server, until you reach its maximum: eight. For example, take the following scenario:

You have 10 Network Virtual Appliances (NVAs), each providing access to some prefixes. It is quite unlikely to come to this number, but it could be possible. For example, because you have multiple types of NVAs for different purposes such as North/South or East/West firewalls, SDWAN, point-to-site VPN, etc. Or because you have multiple instances of the same NVAs for scalability reasons, or maybe because you need to separate those appliances because they serve different customers or partners. Or maybe because the solution you are implementing requires a high number of BGP peerings.

The latter is the situation I was confronted with by a colleague: with Nutanix clusters on Azure, a Nutanix component originally called “BGP VM” can create up to eight BGP peerings with Azure Route Server (you can read this excellent blog post from Jonas Werner for more details about it). If on top of that you have SDWAN NVAs in your architecture, you are in trouble.

What to do then? Finally, we are done with the intro: welcome to the main content of this post.

Next hop unchanged

The solution starts with a typical architecture to many scale problems: you introduce hierarchical aggregation. Translating from fancy words into English: you can have some intermediate “BGP hubs” that consolidate the BGP peerings from your NVAs before handling the routes to the route server, as this diagram shows:

However, there is a problem with this design: the BGP adjacencies between the “BGP hubs” and ARS are eBGP, because their Autonomous System Numbers (ASNs) are different (you can check https://www.bgp.us/ibgp-and-ebgp/ for more details on the differences between iBGP and eBGP). As a consequence, the default behavior of the BGP hubs is to set themselves as next hop in the routes sent over to ARS. This usually makes sense, however in this case we want the routes in NIC1 to point to the final NVA (NVA0, NVA1, etc.), not to the BGP hubs.

Is it possible to override the default eBGP behavior and preserve the original next hop that each NVA advertised to the BGP hubs? The answer is yes, as it is often the case with a protocol as flexible as BGP. A command exists in many mature BGP implementations that does exactly that. In Cisco it is called “next hop unchanged“, in Juniper “no-nexthop-change” or “no-nexthop-self”, in BIRD (what I personally use) is called “next hop keep”. For other NVAs I would suggest to check with your vendor.

Does it work?

Like a charm! Of course I needed to test this. I configured two BGP hubs for redundancy, although you can certainly have more, since this is a critical service. If you are using Linux VMs as I do (Ubuntu 22.04 with the package “bird” installed), you could even use the VMSS-based setup with health probes and automatic self-healing that I described in the Azure Firewall’s sidekick to join the BGP superheroes post. You can see that the BGP hubs are BGP-peered to both the two Azure Route Server instances and to each of the ten NVAs:

jose@bgppeer:~$ sudo birdc 'show prot'
BIRD 1.6.8 ready.
name     proto    table    state  since       info
device1  Device   master   up     15:15:00
direct1  Direct   master   down   15:15:00
kernel1  Kernel   master   down   15:15:00
static1  Static   master   up     15:15:00
nva0     BGP      master   up     15:15:05    Established
nva1     BGP      master   up     15:15:05    Established
nva2     BGP      master   up     15:17:31    Established
nva3     BGP      master   up     15:24:36    Established
nva4     BGP      master   up     15:27:39    Established
nva5     BGP      master   up     15:34:30    Established
nva6     BGP      master   up     15:42:49    Established
nva7     BGP      master   up     15:44:54    Established
nva8     BGP      master   up     15:47:32    Established
nva9     BGP      master   up     15:49:44    Established
rs0      BGP      master   up     15:15:05    Established
rs1      BGP      master   up     15:15:05    Established

If we have a look at the Azure Route Server, it is only peered to the two BGP hubs:

❯ az network routeserver peering list --routeserver $ars_name -g $rg -o table
Name    PeerAsn    PeerIp       ProvisioningState    ResourceGroup
------  ---------  -----------  -------------------  ---------------
peer1   65000      10.13.76.84  Succeeded            routeserver
peer2   65000      10.13.76.85  Succeeded            routeserver

Notice how the BGP hubs have the IP addresses 10.13.76.84 and 10.13.76.85. However, if we look at the learned routes, we can see that the next hop is the actual NVA’s IP address (10.13.76.100 through 10.13.76.109):

❯ az network routeserver peering list-learned-routes --routeserver $ars_name -g $rg --query 'RouteServiceRole_IN_0' -o table -n peer1
AsPath       LocalAddress    Network         NextHop       Origin    SourcePeer    Weight
-----------  --------------  --------------  ------------  --------  ------------  --------
65000-65100  10.13.76.4      100.0.0.100/32  10.13.76.100  EBgp      10.13.76.84   32768
65000-65101  10.13.76.4      100.0.0.101/32  10.13.76.101  EBgp      10.13.76.84   32768
65000-65102  10.13.76.4      100.0.0.102/32  10.13.76.102  EBgp      10.13.76.84   32768
65000-65103  10.13.76.4      100.0.0.103/32  10.13.76.103  EBgp      10.13.76.84   32768
65000-65104  10.13.76.4      100.0.0.104/32  10.13.76.104  EBgp      10.13.76.84   32768
65000-65105  10.13.76.4      100.0.0.105/32  10.13.76.105  EBgp      10.13.76.84   32768
65000-65106  10.13.76.4      100.0.0.106/32  10.13.76.106  EBgp      10.13.76.84   32768
65000-65107  10.13.76.4      100.0.0.107/32  10.13.76.107  EBgp      10.13.76.84   32768
65000-65108  10.13.76.4      100.0.0.108/32  10.13.76.108  EBgp      10.13.76.84   32768
65000-65109  10.13.76.4      100.0.0.109/32  10.13.76.109  EBgp      10.13.76.84   32768

And sure enough, inspecting the effective routes in any virtual machine in the virtual network will show the correct routes being installed:

❯ az network nic show-effective-route-table -n vmVMNic -g $rg -o table
Source                 State    Address Prefix    Next Hop Type          Next Hop IP
---------------------  -------  ----------------  ---------------------  -------------
Default                Active   10.13.76.0/24     VnetLocal
VirtualNetworkGateway  Active   100.0.0.101/32    VirtualNetworkGateway  10.13.76.101
VirtualNetworkGateway  Active   100.0.0.100/32    VirtualNetworkGateway  10.13.76.100
VirtualNetworkGateway  Active   100.0.0.102/32    VirtualNetworkGateway  10.13.76.102
VirtualNetworkGateway  Active   100.0.0.108/32    VirtualNetworkGateway  10.13.76.108
VirtualNetworkGateway  Active   100.0.0.103/32    VirtualNetworkGateway  10.13.76.103
VirtualNetworkGateway  Active   100.0.0.107/32    VirtualNetworkGateway  10.13.76.107
VirtualNetworkGateway  Active   100.0.0.104/32    VirtualNetworkGateway  10.13.76.104
VirtualNetworkGateway  Active   100.0.0.106/32    VirtualNetworkGateway  10.13.76.106
VirtualNetworkGateway  Active   100.0.0.105/32    VirtualNetworkGateway  10.13.76.105
VirtualNetworkGateway  Active   100.0.0.109/32    VirtualNetworkGateway  10.13.76.109
Default                Active   0.0.0.0/0         Internet

By the way, don’t let the source VirtualNetworkGateway confuse you. The mechanism with which Route Server “injects” its routes in the NICs is the same one as ExpressRoute and VPN gateways and even Virtual WAN, so the same source type appears for these dynamically programmed routes.

The last thing is testing the data plane from the VM: give me one ping, Vasili! Although here I am testing with ten pings per NVA, not just with one. The following output only shows the first four ones with no packet loss, but believe me, all of them work fine (Jedi hand move):

jose@vm:~$ for i in $(seq 0 9); do ping -q 100.0.0.10$i -c 10; done
PING 100.0.0.100 (100.0.0.100) 56(84) bytes of data.
--- 100.0.0.100 ping statistics ---
10 packets transmitted, 10 received, 0% packet loss, time 9012ms
rtt min/avg/max/mdev = 1.032/1.492/3.338/0.632 ms
PING 100.0.0.101 (100.0.0.101) 56(84) bytes of data.
--- 100.0.0.101 ping statistics ---
10 packets transmitted, 10 received, 0% packet loss, time 9013ms
rtt min/avg/max/mdev = 1.054/1.300/1.782/0.213 ms
PING 100.0.0.102 (100.0.0.102) 56(84) bytes of data.
--- 100.0.0.102 ping statistics ---
10 packets transmitted, 10 received, 0% packet loss, time 9026ms
rtt min/avg/max/mdev = 0.822/1.273/1.620/0.258 ms
PING 100.0.0.103 (100.0.0.103) 56(84) bytes of data.
--- 100.0.0.103 ping statistics ---
10 packets transmitted, 10 received, 0% packet loss, time 9012ms
rtt min/avg/max/mdev = 1.043/1.607/2.417/0.478 ms
...

Reusing NVAs

OK, you are not too fond of maintaining Linux and BIRD yourself, I hear you say. But you happen to have some appliances in your VNet that support BGP and already peer with Azure Route Server. Could you repurpose those as “BGP hubs” or “BGP aggregators” for other NVAs?

Yes, but you would have to be careful here: these NVAs should set themselves as next hop for the prefixes they are responsible for, but preserve the original next hop for prefixes they get from other NVAs in the VNet. For example, consider the expansion of our previous example where now the BGP hubs are also NVAs that need to inject certain prefixes on their own:

In this case, the BGP hubs should preserve the next-hop in the updates that come from the NVAs on the right (the blue prefixes 100.0.0.100-109, which have ASNs 65100 through 65109), but they should put themselves as next-hop for the prefixes they are responsible for (the green prefix 100.0.0.200 in the diagram above). This means that the configuration cannot be made at the neighbor level, but instead with a route map. For example, a possible configuration for a Cisco device might be as follows:

router bgp 65000
 bgp log-neighbor-changes
 neighbor 10.13.76.4 remote-as 65515
 neighbor 10.13.76.4 ebgp-multihop 2
 neighbor 10.13.76.5 remote-as 65515
 neighbor 10.13.76.5 ebgp-multihop 2
 !
 address-family ipv4
  neighbor 10.13.76.4 activate 
  neighbor 10.13.76.4 route-map ToARS out
  neighbor 10.13.76.5 activate 
  neighbor 10.13.76.5 route-map ToARS out
  exit address-family
!
route-map ToARS permit 10
 match as-path 1
 set ip next-hop unchanged
route-map ToARS permit 20
!
as-path access-list 1 permit ^6510[0-9]

The previous configuration does the following:

Peer with each of the two ARS instances (10.13.76.4 and 10.13.76.5, ASN 65515).

For each of the two ARS instances apply the route-map ToARS in the outgoing direction.

For prefixes matching the AS-path list 1 (which matches everything coming from ASNs 65100-65109), preserve the original next-hop.

For everything else, use the default behavior (which for eBGP is using next-hop self).

Disclaimer: I haven’t had the time of actually testing the configuration above, the main goal is showing the concept: you do next-hop self for some routes, next-hop preserver for others. Of course, for other NVA vendors the required configuration will be different, or it might not even be possible (for example, I didn’t find out how to do it with my beloved BIRD 1.6).

This way you don’t need to add extra appliances to your design, and your existing NVAs can take over the role of “BGP hub” or “BGP aggregator”.

Conclusion

If your design happens to need more BGP neighbors than what Azure Route Server supports, worry not: there is a workaround. You need to verify whether your NVAs support a “next hop keep” sort of functionality, and you should be good to go. However, before going this way, my suggestion would be to think hard whether your design is not overly complex, since usually the ARS limit of 8 peers is enough for most architectures.

Did I forget to mention anything? Do you disagree on any point? Please let me know in the comments below!

Share this:
X
Facebook
Like Loading...

Related

3 thoughts on “Going beyond 8 peers in Azure Route Server”

Jaeok Lee

September 26, 2025 at 2:41 am

Hi Jose,I’ve always appreciated your insightful and inspiring blog posts such a pleasure to see a new update after a while! Your latest piece truly made my day.

I had a quick technical question:If iBGP is established between the ARS and the BGP Hub, and eBGP is configured between the BGP Hub and the NVAs, wouldn’t that naturally resolve the nexthop issue?From what I understand, iBGP peers retain the original nexthop when advertising routes, so the BGP Hub would preserve the ARS’s nexthop.Please feel free to correct me if I’m off base apologies if this is a basic question.

Wishing you a great day, and looking forward to your next post as always !!!!

LikeLike

1. erjosito
  
  September 26, 2025 at 7:15 am
  
  Hey Jaeok, thanks for your kind words! In theory you are right, but I always thought that ARS only supports eBGP. I have never tried configuring an iBGP peer, not sure if it even works.
  
  LikeLike
  
[FI] Tietoliikennealan katsaus 2025-10 – loopback1.net

November 7, 2025 at 9:01 am

[…] Moreno opastaa miten Azure Route Serverin kahdeksan peerin raja kierretään. Sehän onnistuu NVA:n BGP-asetuksella pitämällä nexthopin muuttumattomana. John Savillilla on […]

LikeLike

What was Azure Route Server again?

“Only” 8 BGP peerings!

Next hop unchanged

Does it work?

Reusing NVAs

Conclusion

Share this:

Related

3 thoughts on “Going beyond 8 peers in Azure Route Server”

Leave a reply to Jaeok Lee Cancel reply