An old Cherokee proverb says: “Listen to the whispers and you won’t have to hear the screams”. Routing problems are hard: Hard to uncover, because sometimes they will not become apparent until something happens. For example, when your backup routes disappear, and you only notice when the primary routes are gone too. And hard in the way they hit you, because when routing fails, everything else does too. But routing problems can be avoided, if you listen carefully.
I finally had time during my Easter vacation to look into a project I have had in the back of my mind for quite some time now: how to analyze the routes learnt by Azure virtual networks, to make sure that routing works as it should? I used a Network Virtual Appliance to send BGP events to Azure Monitor, so that routing messages can be analyzed to identify routing issues before they do any damage.
First of all, this is not a new idea. Route Analytics was already a thing long before public cloud. According to Wikipedia, Route Analytics “is an emerging network monitoring technology specifically developed to analyze the routing protocols and structures in meshed IP Networks. Their main mode of operation is to passively listen to the Layer 3 routing protocol exchanges between routers for the purposes of network discovery, mapping, real-time monitoring and routing diagnostics.”
Wikipedia describes it an “emerging” technology, but that article was written a while back. Many moons ago I have worked with route analytics software such as Packet Design RAMS, and I loved it. That product is no more, though. There are other more modern approaches such as batfish. Probably there are quite a bit of other old and new school approaches out there, but I was not able to find anything that fulfilled my requirements:
- Monitor Virtual Networks added to an Azure hub and spoke topology
- Identify routes injected via ExpressRoute or site-to-site VPN
- Monitor BGP Autonomous System paths in use
- Identify potential problems such as overlapping prefixes
- Visually identify network events
Hence, I decided to reinvent the wheel. If you have an idea about a simpler way to do this, happy to learn!
BGP Analytics Azure Setup
A possible first approach might be getting different route tables (for example from the ExpressRoute or VPN gateways) at regular intervals and logging the results into a log management system. However, this regular polling has the problem of potentially missing events that happen between polling intervals, such as route or adjacency flaps.
Consequently, the desired architecture should follow a push, rather than a pull approach, and originate messages whenever some event at the BGP level happens. The new Azure Route Server offers an API into the routing processes running inside of a Virtual Network, so it looks like a good start.
My first idea was enabling Diagnostics Settings in the Azure Route Server to capture BGP events. Unfortunately, “The resource type ‘microsoft.network/virtualhubs’ does not support diagnostic settings“. And even if it were available, chances are that the logs are going to be too superficial, but I might be falling into my pessimistic mode here.
Hence, the approach I decided to try is getting all the routes from the Azure Route Server in a Network Virtual Appliance (NVA), and configuring the BGP software on the NVA to push BGP events to Azure Monitor. If you want to build a similar setup in Azure, I put a CLI script that will deploy all components here. More or less, this:
As NVA I use a Linux appliance with the BIRD routing software. BIRD supports logging received BGP messages with the Multi-Threaded Routing Toolkit (MRT) format, documented in RFC 6396. BIRD will log all received messages into a file, that can be interpreted with tooling such as the mrtparse Python module.
After decoding the messages, they can be sent to Azure Monitor using the HTTP Data Collection API (in public preview at the time of this writing). The script reading the MRT messages and sending them to Azure Monitor can be run in one minute intervals with crontab. The script I used can be found in the Github repository complementing this article here.
The script does some formatting before sending the messages to Azure Monitor, such as flattening the JSON structure. Besides, it will get the required credentials with a user managed identity, so that no secret needs to be coded anywhere.
After the logs arrive to Azure Monitor, they can be analyzed with Kusto queries, to achieve our initial objectives. The workbook containing the charts in this document can be obtained from the Github repository here.
Azure Monitor Workbook Example
As examples of some queries that would be helpful, I started with a brief histogram analysis of the BGP message types sent over time. As the following picture shows, there is a relatively constant level of BGP keepalive messages over time, and changes to that constant pattern can be easily identified in the timeline.
For example, the first bump in the chart below (at around 8:50pm) corresponds to attaching an ExpressRoute circuit to the VNet. The second and larger one (at around 10:30pm) is due to a test simulating an outage during which the BGP adjacencies between Route Server and NVA were torn down during some time, and then restablished.
The second set of queries give an idea of the amount of prefixes and AS paths existing in the network. This is important, for example to identify unwanted routes coming from unexpected AS paths. One use case where this could be relevant is when a BGP misconfiguration on premises sends prefixes learnt over the ExpressRoute Microsoft peering towards the private peering. This kind of visuals would help to identify that problem very quickly.
In the example below I only have 4 routes in the system, but in a production scenario you can have hundreds of them.
And lastly, different problematic scenarios can be identified through careful analysis of the BGP logs. In this example the BGP messages are searched to find overlapping routes, which might happen if different Virtual Networks with overlapping IP prefixes are connected to the same ExpressRoute circuit (yes, Azure will allow you to do that, and yes, I have seen this happening in a real environment):
Additional queries might focus on the route table stability to identify route flapping, or other aspects that can be important for the correct operation of the network.
Hopefully this will give you an idea of what can be possible by analyzing your BGP routes. Whether you use your own setup like here, or a commercial or open source product, there are many insights to be gained by having a deep look into your BGP cogwheels.