Route Server Multi-Region Design

In my previous blog I wrote my view on the characteristics of the new Azure Route Server that I am most excited about. In this one I would like to give you a glimpse of how it works with a design that I see in many organizations: a multi-region setup, with Network Virtual Appliances acting as firewalls or VPN devices in each one, with active/passive redundancy. Here is the testbed that I will be using:

Test Topology

As you can see in the diagram, we will need some BGP functionality in our NVAs. For that purpose I am using Linux VMs running bird to simulate those NVAs. And before you ask, you can download a full deployment script from my Github repository here.

Spoke-to-Spoke, Same Region

As you can see, we have 2 hub VNets, each in a different region, and 2 spokes connected to each hub. Let us focus in one of them to start with. Traditionally, you would have to configure User-Defined Routes in each spoke to reach the rest of the world. For example, for spoke11 to reach spoke12 it needs to have a route for it.

But with the Azure Route Server now we do not need to add those UDRs any more. We have one Route Server in hub1, and it has established a BGP adjacency with the NVA:

az network routeserver peering list --vrouter-name $hub1_rs_name -g $rg -o table
Name     PeerAsn    PeerIp    ProvisioningState    ResourceGroup
-------  ---------  --------  -------------------  ---------------
hub1nva  65001      10.1.1.4  Succeeded            routeserver

Please ignore the “vrouter” name in the command argument instead of “routeserver”, this will be fixed soon. As you can see , the Route Server is talking over BGP with the NVA at 10.1.1.4 and ASN 65001. Let’s see what routes it is receiving:

az network routeserver peering list-learned-routes -n hub1nva \
   --vrouter-name $hub1_rs_name -g $rg --query 'RouteServiceRole_IN_0' -o table
LocalAddress    Network      NextHop    SourcePeer    Origin    AsPath      
--------------  -----------  ---------  ------------  --------  ----------- 10.1.0.4        10.1.0.0/16  10.1.1.4   10.1.1.4      EBgp      65001       
10.1.0.4        10.2.0.0/16  10.1.1.4   10.1.1.4      EBgp      65001-65002 

Interesting, it is receiving two prefixes. The first one (10.1.0.0/16) is originated by the neighbor NVA, since the AS path is just one ASN, 65001. The second one (10.2.0.0/16) is generated by the NVA in hub2, as you can see in its AS path. 10.1.0.0/16 and 10.2.0.0/16 are summaries that aggregate all address space in each region. The Route Server will learn these routes, and program them as effective routes in all NICs in its VNet as well as in directly peered VNets. Let’s check the effective routes in spoke11, for example:

az network nic show-effective-route-table --ids $spoke11_vm_nic_id -o table
Source                 State    Address Prefix   Next Hop Type          Next Hop
---------------------  -------  ---------------- ---------------------  --------
Default                Active   10.1.16.0/24     VnetLocal
Default                Active   10.1.0.0/20      VNetPeering
VirtualNetworkGateway  Active   10.2.0.0/16      VirtualNetworkGateway  10.1.1.4
VirtualNetworkGateway  Active   10.1.0.0/16      VirtualNetworkGateway  10.1.1.4
Default                Active   0.0.0.0/0        Internet
Default                Active   10.0.0.0/8       None
Default                Active   100.64.0.0/10    None
Default                Active   192.168.0.0/16   None
Default                Active   25.33.80.0/20    None
Default                Active   25.41.3.0/25     None

You can see in the effective routes appear the two summaries, both with the NVA in hub1 as next hop. For traffic inside of one spoke we are interested in 10.1.0.0/16. So both spokes know that in order to reach each other, the need to send traffic to the NVA, which will forward the packets.

Why Summary Routes?

You might be wondering why we are not sending exactly the spoke prefixes, and instead the NVA advertises the 10.1.0.0/16 sumary route.

There are two reasons why it is recommended advertising summaries from the NVA: firstly, the Route Server will not learn routes that it already knows. And it already knows the spoke prefixes (10.1.16.0/24 and 10.1.17.0/24 in this example), so if you send exactly those prefixes, they will not be injected as effective routes.

Secondly, your Route Server might send these prefixes to somewhere else, such as ExpressRoute. ExpressRoute gateways have limits in the number of prefixes they can advertise from Azure to on-premises devices, so being frugal in this area might save you some trouble in the future.

Inter-Region Traffic

We have seen that the spokes in region 1 learn the 10.2.0.0/16 sumary, it would be expected that the spokes in region 2 learn the 10.1.0.0/16 prefix. Let’s verify it:

az network nic show-effective-route-table --ids $spoke21_vm_nic_id -o table
Source                 State    Address Prefix   Next Hop Type          Next Hop
---------------------  -------  ---------------- ---------------------  --------
Default                Active   10.2.16.0/24     VnetLocal
Default                Active   10.2.0.0/20      VNetPeering
VirtualNetworkGateway  Active   10.2.0.0/16      VirtualNetworkGateway  10.2.1.4
VirtualNetworkGateway  Active   10.1.0.0/16      VirtualNetworkGateway  10.2.1.4
Default                Active   0.0.0.0/0        Internet
Default                Active   10.0.0.0/8       None
Default                Active   100.64.0.0/10    None
Default                Active   192.168.0.0/16   None
Default                Active   25.33.80.0/20    None
Default                Active   25.41.3.0/25     None

Fantastic! Spokes in region 1 know that to talk to region 2 they need to go their NVA, and the same happens in region 2. Done deal, right? Not that fast. We have a small problem. Let’s look at the effective routes in the NIC of NVA1 in hub1:

az network nic show-effective-route-table --ids $hub1_nva_nic_id -o table
Source                 State    Address Prefix   Next Hop Type          Next Hop
---------------------  -------  ---------------- ---------------------  --------
Default                Active   10.1.0.0/20      VnetLocal
Default                Active   10.1.16.0/24     VNetPeering
Default                Active   10.1.17.0/24     VNetPeering
VirtualNetworkGateway  Active   10.2.0.0/16      VirtualNetworkGateway  10.1.1.4
VirtualNetworkGateway  Active   10.1.0.0/16      VirtualNetworkGateway  10.1.1.4
Default                Active   0.0.0.0/0        Internet
Default                Active   10.0.0.0/8       None
Default                Active   100.64.0.0/10    None
Default                Active   192.168.0.0/16   None
Default                Active   25.33.80.0/20    None
Default                Active   25.41.3.0/25     None
Default                Active   10.2.0.0/20      VNetGlobalPeering

As you would expect, the same routes as there as for the spokes, the Route Server does not distinguish between hub and spokes: the same routes will be programmed everywhere. Why is this a problem? Because when NVA1 in hub1 tries to reach NVA2 in hub2, the packet will hit Azure’s network address to some 10.2.x.x address. But the UDR in the NIC will kick in, and instead of forwarding the packet to NVA2 in hub2, it will return it to NVA1 again. Nice routing loop we got here.

There are two possible solutions I can think of: first, you could overwrite the 10.2.0.0/16 route with an UDR pointing to NVA2 (instead of NVA1), but this configuration would be static and potentially not react to potential problems in the NVA (read further down for NVA redundancy).

The second solution is preventing that the packet reaches the Azure network whatsoever, by encapsulating it in a tunnel. This is the approach I have taken, where NVA1 and NVA2 talk BGP over a VXLAN tunnel (GRE is not supported in Azure, in case you were thinking in that direction), so the Azure network will only see packets going on between the NVAs, but no other source or destination IP address will be visible to Azure.

Let’s have a look at the routing table of NVA1:

ssh  $hub1_nva_pip_ip "sudo birdc show route"
BIRD 1.6.3 ready.
10.2.0.0/16        via 192.168.0.2 on vxlan0 [hub2a 17:24:50] * (100/0) [AS65002i]
10.2.0.0/20        via 192.168.0.2 on vxlan0 [hub2a 17:24:51] * (100/0) [AS65515i]
10.1.0.0/16        via 10.1.1.1 on eth0 [static1 17:24:48] * (200)
10.1.0.0/20        via 10.1.1.1 on eth0 [rs0 17:24:49 from 10.1.0.4] * (100/?) [AS65515i]
                   via 10.1.1.1 on eth0 [rs1 17:24:50 from 10.1.0.5] (100/?) [AS65515i]
10.2.16.0/24       via 192.168.0.2 on vxlan0 [hub2a 17:24:51] * (100/0) [AS65515i]
10.2.0.4/32        via 192.168.0.2 on vxlan0 [hub2a 17:24:50] * (100/0) [AS65002i]
10.2.17.0/24       via 192.168.0.2 on vxlan0 [hub2a 17:24:51] * (100/0) [AS65515i]
10.2.0.5/32        via 192.168.0.2 on vxlan0 [hub2a 17:24:50] * (100/0) [AS65002i]
10.1.0.5/32        via 10.1.1.1 on eth0 [static1 17:24:48] * (200)
10.1.16.0/24       via 10.1.1.1 on eth0 [rs0 17:24:49 from 10.1.0.4] * (100/?) [AS65515i]
                   via 10.1.1.1 on eth0 [rs1 17:24:50 from 10.1.0.5] (100/?) [AS65515i]
10.1.0.4/32        via 10.1.1.1 on eth0 [static1 17:24:48] * (200)
10.1.17.0/24       via 10.1.1.1 on eth0 [rs0 17:24:49 from 10.1.0.4] * (100/?) [AS65515i]
                   via 10.1.1.1 on eth0 [rs1 17:24:50 from 10.1.0.5] (100/?)[AS65515i]
192.168.0.1/32     via 192.168.0.2 on vxlan0 [hub2a 17:24:50] * (100/0) [AS65002i]
192.168.0.2/32     dev vxlan0 [static1 17:24:48] * (200)
192.168.0.6/32     dev vxlan1 [static1 17:24:48] * (200) 

There is a lot going on here, but for now just focus on the egress interface of the routes going to 10.2.x.x, which is a vxlan interface. On the other hand, routes for 10.1.x.x point to the good, old eth0 interface.

At this point, we now have inter-hub communication. Yay!

Where are the /24 routes?

In the previous command output you might have spotted the spoke21 and spoke22 prefixes in the route table of NVA1 (10.2.16.0/24 and 10.2.17.0/24). Here the two routes for completeness:

10.2.16.0/24       via 192.168.0.2 on vxlan0 [hub2a 17:24:51] * (100/0) [AS65515i]
10.2.17.0/24       via 192.168.0.2 on vxlan0 [hub2a 17:24:51] * (100/0) [AS65515i]

However, is NVA1 is learning those prefixes from NVA2, why is the Route Server in hub 1 not injecting them as effective routes? (if you remember, in the effective routes there were only /16 routes).

The reason is the Autonomous System Number of the Route Servers, which is always 65515, not configurable. So when the RS in hub1 gets those /24 prefixes sourced 65515, it will throw them away following BGP loop prevention rules.

High Availability

The Azure Route Server is deployed as a highly available pair of appliances. If you have a look at a Route Server, it shows two IP addresses with which appliances need to peer:

az network routeserver show -n $hub1_rs_name -g $rg --query virtualRouterIps -o tsv
10.1.0.4
10.1.0.5

And if we have a look at the adjacencies of NVA1, we will see both instances:

ssh -n -o BatchMode=yes -o StrictHostKeyChecking=no $hub1_nva_pip_ip "sudo birdc show protocols"
BIRD 1.6.3 ready.
name     proto    table    state  since       info
device1  Device   master   up     17:24:49    
direct1  Direct   master   down   17:24:49    
kernel1  Kernel   master   up     17:24:49    
static1  Static   master   up     17:24:49    
rs0      BGP      master   up     17:24:50    Established   
rs1      BGP      master   up     17:24:51    Established   
hub2a    BGP      master   up     17:24:51    Established   
hub2b    BGP      master   up     23:57:48    Established

But we see something else: as the diagram in this article showed, we have two NVAs in hub2 to verify redundancy in an active/standy fashion. The standby NVA is advertising the routes with a longer AS path. Let’s inspect one of the routes in NVA1:

ssh $hub1_nva_pip_ip "sudo birdc show route all 10.2.0.0/16"
BIRD 1.6.3 ready.
10.2.0.0/16     via 192.168.0.2 on vxlan0 [hub2a 17:24:50] * (100/0) [AS65002i]
        Type: BGP unicast univ
        BGP.origin: IGP
        BGP.as_path: 65002
        BGP.next_hop: 192.168.0.2
        BGP.local_pref: 100
                via 192.168.0.6 on vxlan1 [hub2b 23:57:47] (100/0) [AS65002i]
        Type: BGP unicast univ
        BGP.origin: IGP
        BGP.as_path: 65002 65002
        BGP.next_hop: 192.168.0.6
        BGP.local_pref: 100

As you can see we have two routes, but one of them has a longer AS path from the secondary appliance. The same happen at the other side: the Route Server is learning the same routes from the both appliances:

az network routeserver peering list-learned-routes -n hub2nva --vrouter-name $hub2_rs_name -g $rg --query 'RouteServiceRole_IN_0' -o table
LocalAddress    Network      NextHop    SourcePeer    Origin    AsPath       Weight
--------------  -----------  ---------  ------------  --------  -----------  --------
10.2.0.4        10.2.0.0/16  10.2.1.4   10.2.1.4      EBgp      65002        32768
10.2.0.4        10.1.0.0/16  10.2.1.4   10.2.1.4      EBgp      65002-65001  32768
az network routeserver peering list-learned-routes -n hub2nva2 --vrouter-name $hub2_rs_name -g $rg --query 'RouteServiceRole_IN_0' -o table
LocalAddress    Network      NextHop    SourcePeer    Origin    AsPath             Weight
--------------  -----------  ---------  ------------  --------  -----------------  --------
10.2.0.4        10.2.0.0/16  10.2.1.5   10.2.1.5      EBgp      65002-65002        32768
10.2.0.4        10.1.0.0/16  10.2.1.5   10.2.1.5      EBgp      65002-65002-65001  32768

But it will only program the routes with the shortest AS path in the effective routes of the NICs:

az network nic show-effective-route-table --ids $spoke21_vm_nic_id -o table
Source                 State    Address Prefix    Next Hop Type          Next Hop IP
---------------------  -------  ----------------  ---------------------  -------------
Default                Active   10.2.16.0/24      VnetLocal
Default                Active   10.2.0.0/20       VNetPeering
VirtualNetworkGateway  Active   10.2.0.0/16       VirtualNetworkGateway  10.2.1.4
VirtualNetworkGateway  Active   10.1.0.0/16       VirtualNetworkGateway  10.2.1.4
Default                Active   0.0.0.0/0         Internet
Default                Active   10.0.0.0/8        None
Default                Active   100.64.0.0/10     None
Default                Active   192.168.0.0/16    None
Default                Active   25.33.80.0/20     None
Default                Active   25.41.3.0/25      None

So what happens if we have an outage on the primary NVA? Let’s shut it down and see what happens. First, I started a ping from spoke11 to spoke21. Not a single ping was lost, the removal of the route was extremely fast.

We can now have a look at the effective routes agin. As expected, they are now pointing to the secondary NVA, the 10.2.1.5:

az network nic show-effective-route-table --ids $spoke21_vm_nic_id -o table
Source                 State    Address Prefix    Next Hop Type          Next Hop IP
---------------------  -------  ----------------  ---------------------  -------------
Default                Active   10.2.16.0/24      VnetLocal
Default                Active   10.2.0.0/20       VNetPeering
VirtualNetworkGateway  Active   10.2.0.0/16       VirtualNetworkGateway  10.2.1.5
VirtualNetworkGateway  Active   10.1.0.0/16       VirtualNetworkGateway  10.2.1.5
Default                Active   0.0.0.0/0         Internet
Default                Active   10.0.0.0/8        None
Default                Active   100.64.0.0/10     None
Default                Active   192.168.0.0/16    None
Default                Active   25.33.80.0/20     None
Default                Active   25.41.3.0/25      None

If we look at the routes in NVA1, we would see they have converged to NVA2 too (if the vxlan interfaces that my script creates were reboot-persistent).

Conclusion

We have deployed a hub and spoke environment without having to create route tables in the spokes, which is very attractive for organizations where the spoke admins do not necessarily have networking expertise. Additionally, we have created a dynamic topology across multiple regions with redundancy, where BGP makes sure that the paths are selected optimally.

And we have done all of the above using a Linux-based virtual appliance, meaning that you can use this design with any NVA vendor that supports BGP (with a little help of its VXLAN friends).

3 thoughts on “Route Server Multi-Region Design

  1. […] a hub and spoke design with firewall NVAs (Network Virtual Appliance) across multiple regions here. In this one I will focus on how to integrate the topology with ExpressRoute, and how NVAs can […]

    Like

  2. […] Hence I will not cover it in this post, but it should be similar to the design described in Azure Route Server multi-region design and Connecting your NVAs to ExpressRoute with Azure Route […]

    Liked by 1 person

  3. […] the actual packet destination IP from the Azure SDN, hence making routing possible (refer to this blog for more […]

    Like

Leave a comment