Monitoring Azure Networks with Alerts

Monitoring is one of those underrated disciplines: everybody tells you to do it, but nobody tells you exactly how. As a consequence, there are many different approaches and few concrete recommendations.

Before continuing, a word of caution: I am not going to cover introductory topics in this post. If you are not familiar with Virtual WAN, make sure you read the docs or watch the videos in https://aka.ms/vwanvideos. Especially related to this topic is the video on Virtual WAN monitoring and metrics by my colleague Nirmal.

I have been looking into different ways of configuring Azure Monitor alerts for VPN and ExpressRoute connections to Azure on Virtual WAN, and I can share my personal lessons learnt:

  • Always use Connection Monitors alerts as the ultimate proof of connectivity
  • You don’t want to rely on dynamic thresholds for certain things such as routing alerts (like the number of BGP neighbors).
  • Metrics-based alerts are very comfortable to use, but logs-based alerts might give you better reaction times.
  • Your minimum reaction time is going to be around 2 minutes for some scenarios, but in others you will be around 5 minutes (typical alert latency for Azure Connection Monitor).

In the rest of the post I will go over some failure scenarios, what alerts I had configured and how, and which ones fired (or not) and when. Spoiler alert: there might be some surprises!

You want to look at this repo

If you didn’t know this, now you have no excuse: you can deploy Azure Monitor alerts automatically. This way you will never forget to configure alerts for the resources in your production subscriptions. For example, https://github.com/Azure/alz-monitor shows you how to do this, and includes an awesome list of recommended alerts grouped by categories.

I have deployed in my subscription the connectivity policies, which creates this Azure Policy Assignment:

I created a Virtual WAN in my subscription with VPN and ExpressRoute gateways, and oh wonder, after some minutes my resource group was full with Azure Monitor goodness, a sample of which you can see here:

The individual alerts in the previous snapshot are not important, the cool thing is that they were created automagically.

Connection Monitor: the ultimate question

Those alerts will tell you when something is suspicious or wrong, but not necessarily if connectivity for end users has been impaired. For that, you want to configure Azure Connection Monitor tests, with their corresponding alerts. In my case, I configured continuous ICMP and TCP pings from Azure towards my on-premises VPN device:

You should configure different test (groups) for each network path you want to verify, and don’t forget defining the alerts for your groups.

Some additional alerts

Additionally to the metrics-based alerts provided by the ALZ-Monitor project, I configured a couple more of alerts of my own. For example, for Virtual WAN ExpressRoute Gateways I configured these three metrics-based alerts:

For the VPN gateway, I configured two logs-based alerts, to look for BGP or IPsec disconnect notifications:

The KQL query for the BGP disconnect logs is this:

AzureDiagnostics
| where TimeGenerated > ago(5m)
| where Category == "RouteDiagnosticLog"
| where OperationName == "BgpDisconnectedEvent"

And for completeness, the KQL query for IPsec disconnect messages:

AzureDiagnostics
| where TimeGenerated > ago(5m)
| where Category == "TunnelDiagnosticLog"
| where OperationName == "TunnelDisconnected"

And for my ExpressRoute circuit, two alerts:

  • The first alert is based on the logs that tell you the number of routes. However, that is only effective to monitor when you get close to the limit. Once the number of routes exceeds 1,000, the BGP adjacency goes down, and the number of routes goes down again. Hence, one lesson learnt: you want to monitor the routes as well before the ExpressRoute circuit, in Virtual WAN
  • The second one monitors the bandwidth for the two lines of the ExpressRoute circuit.

Before you ask, here the KQL query for the circuit route table size:

AzureDiagnostics
| where Category == 'PeeringRouteLog'
| where path_s contains '65515'
| distinct network_s
| summarize count()

Unfortunately at this point in time you cannot configure alerts (neither metrics-based nor logs-based) for Virtual WAN hubs, but these ones will give us good visibility.

Dynamic Thresholds are ‘meh’ for routing

For the VPN Gateway one of the alerts created by the ALZ-Monitor project measures the number of healthy BGP peers. However, the threshold is dynamic, and in a VPN outage test the alert didn’t fire (even if I set sensitivity to High and excluded initial time from the calculation). The reason is because the dynamically calculated threshold wasn’t aggressive enough:

The previous chart shows that the minimum threshold calculated is 4, but I want to fire as soon as one neighbor goes down. Hence, I changed that to a static threshold:

The downside of static thresholds is that you will have to change your alert definition as you change your environment (in this case, as you add or remove VPN connections), so please be careful with these.

Reaction times

When shutting down the IPsec tunnels from on-premises, Azure Monitor started looking like a firework:

  • The logs-based alert for IPsec disconnection fired at around 2 minutes from the problem occurrence
  • The logs-based alert for BGP disconnection fired at around 3 minutes from the problem occurrence
  • The Connection Monitor alert fired at around 5 minutes
  • The BGP neighbor count and the bandwidth alerts fired later than that. My guess is because the minimum lookback period of those alerts is 5 minutes:

By the way, in order to get the logs-based alerts to fire that quickly, you need to make sure to look into the advanced options, since the default for firing is for some alerts 4 consecutive violations, but you typically don’t want to wait that long:

To do a short time analysis, as we were saying earlier the first alert firing was the logs-based alert for IPsec disconnection events, at 9:17 (around 2 minutes after the actual disconnection):

The connection monitor alert did a nice job at telling us that connectivity was impaired (albeit 3 minutes later), and Azure Monitor shows you an evolution of the failed network connectivity attempts that fired the alert, as well as when connectivity was restored (which automatically clears the alert in Azure Monitor):

The bandwidth per tunnel metric is a good way to monitor tunnels, but it was a bit slower than Connection Monitor (if you look at the time axis), due to the 5-minute aggregation time:

And that evaluation time is as well the reason why the BGP alert took a bit longer to fire. As you can see, both the tunnel bandwidth and the BGP peer status have fewer data points than the Connection Monitor alerts due to the fact that they aggregate metrics over 5 minutes:

ExpressRoute

What would happen if you don’t have a Connection Monitor alert configured? In some cases you might lose substantial visibility, and complexity might be higher. For example, I simulated a situation where I injected more than 1,000 routes to ExpressRoute, which exceeds the limits of the circuit.

I introduced the problem twice. Let’s focus on the second occurrence. My alert for detecting more than 1K learned routes fired at 10:54, around 4:30 minutes after I introduced the excess routes. The response time is in the same order of magnitude as connection monitor alerts, as we saw in the previous section.

The second alert I have looking at the advertised routes fired 5 minutes later, at 10:59:

Interestingly enough, the ExpressRoute alerts on the circuit didn’t fire. For example, the BGP availability kept at 100%, since this metric only looks at the BGP neighborships with the customer routers, not with the ExpressRoute gateways:

Layer-2 problem

Another problem I simulated was changing the VLAN ID in my on-premises router to an incorrect value. I have to say here that Megaport states that 2 minutes might pass between the time when you send the API call and the actual configuration change, but I will account for that: after changing the VLAN, I observed with Azure CLI when the BGP adjacency came down, and it happened around 3:30 minutes after pushing the button:

❯ az network express-route list-route-tables-summary -g $rg -n Washington --path primary --peering-name AzurePrivatePeering --query value -o table
As      Neighbor        StatePfxRcd    UpDown    V
------  --------------  -------------  --------  ---
133937  169.254.75.201  Active         00:00:31  4
65515   192.168.0.12    44             00:17:06  4
65515   192.168.0.13    44             00:17:04  4

Around six minutes later, the first alert fired, the ExpressRoute circuit BGP availability:

It took the bandwidth-based alert another 3 minutes to fire, but eventually it came up too:

Unfortunately, the Virtual WAN ExpressRoute gateway doesn’t generate logs, so I couldn’t configure logs-based alerts as in the VPN test before.

Conclusion

Even if today you can only configure logs-based and metrics-based alerts in the Virtual WAN VPN gateway (the Virtual WAN ExpressRoute gateway support only metrics-based, and the virtual hub supports none), you can still get good visibility of what is happening.

A crucial element for that are Connection Monitor alerts, which will give you the ultimate proof of whether traffic is flowing through your network or not.

2 thoughts on “Monitoring Azure Networks with Alerts

  1. Saul Dolgin

    Excellent write up Jose. It’s nice to see the ALZ-Monitor solution demonstrated in this article. Thanks for sharing!

    Liked by 1 person

  2. […] on nyt kaikkien käytettävissä paremman näkyvyyden saamiseksi. Jose Moreno kertoo blogissaan miten Azuren VPN- ja Express Route -yhteyksiä voi valvoa paremmin rajoittuneista ominaisuuksista huolimatta Connection Monitor -hälytyksillä. Toisessa blogissa […]

    Like

Leave a comment