ExpressRoute traffic visibility: Flow Logs or Traffic Collector?

You might have heard about VNet Flow Logs, I posted about this new Azure feature here. One of the applications of VNet Flow Logs is to gain visibility into traffic in places that had been blind spots until now, such as in the Gateway Subnets to inspect traffic on VPN or ExpressRoute.

Talking about ExpressRoute, there is another feature that gives you traffic visibility: ExpressRoute Traffic Collector. This functionality was until recently only available for ExpressRoute Direct circuits, but since this is now working on provider-managed ExpressRoute circuits as well for bandwidths of 1Gbps or higher (see this FAQ), I have finally been able to test it.

Before we dive in, a word of caution: VNet Flow Logs are very versatile, and in this blog post I will primarily focus on using them in the gateway subnet of an Azure Virtual Network for ExpressRoute. Other usages of VNet Flow Logs, such as using them in the gateway subnet for VPN gateway traffic, in the firewall subnet or anywhere else are out of scope for this post.

The first main difference between the two is where the logs are taken: while VNet Flow Logs can capture traffic at any given subnet, specifically in the GatewaySubnet in this example, ExpressRoute Traffic Collector will collect the logs in the external interfaces of the Microsoft edge routers for ExpressRoute (often referred to as MSEE or Microsoft Enterprise Edge). Consequently, the traffic you will capture with each approach will be different:

A second difference is the fact that Traffic Collector will sample data with a 1-to-4,096 ratio (only one out of 4,096 packets will be logged). While this is great from a cost perspective, you need to take this into account when drawing your conclusions.

Finally, the mechanisms used to record flows are very different in both cases, so consequently the fields stored for each flow are different too. But more to this later.

Topology

Let’s start with the topology that I have used to test this. I have a traditional hub-and-spoke environment, with two spokes and an ExpressRoute gateway:

In the ExpressRoute circuit I have a private peering configured:

I am simulating some traffic flows:

  • From Azure to on-premises.
  • From on-premises to Azure.
  • From spoke to spoke (this traffic is hairpinned to the MSEE).

Trending

One of the most frequent use cases for traffic visibility is evaluating the amount of traffic going through ExpressRoute. It is very easy to build this kind of perspective with Traffic Collector and separating between inbound/outbound and outbound traffic (note that in the query below you would have to multiply by 4,096 to get more approximate values):

ATCExpressRouteCircuitIpfix
| where TimeGenerated > ago(2h)
| extend Direction = iff(DstAsn == 65515, "Inbound", "Outbound")
| summarize Bytes=4096*sum(NumberOfBytes) by bin(TimeGenerated, 5m), Direction
| render timechart

Note in the query below how the DstAsn field is very convenient to separate traffic going to Azure (always with the Autonomous System Number 65515) from traffic going to on-premises.

With VNet Flow Logs this kind of statistic is more complicated. Firstly, only traffic coming from on-premises is logged, since Azure-to-on-premises packets don’t traverse the ExpressRoute Gateway.

Additionally, the way in which the information is logged in VNet Flow Logs is different: while Traffic Collector is “stateless” (it will record how much info goes from A to B), VNet Flow Logs are stateful: they record who started the connection, and store how much traffic goes in which direction in the fields BytesSrcToDest and BytesDestToSrc. Hence, for all bytes going from A to B you need to add BytesSrcToDest for flows from A to B, and BytesDestToSrc for flows from B to A:

let outboundforward = (NTANetAnalytics
| where ipv4_is_in_range(SrcIp, '192.168.0.0/16') and ipv4_is_in_range(DestIp, '10.0.0.0/8')
| summarize TotalBytes=sum(BytesSrcToDest)
| extend FlowDirection='Outbound', TransferDirection='Outbound');
let outboundreturn = (NTANetAnalytics
| where ipv4_is_in_range(SrcIp, '192.168.0.0/16') and ipv4_is_in_range(DestIp, '10.0.0.0/8')
| summarize TotalBytes=sum(BytesDestToSrc)
| extend FlowDirection='Outbound', TransferDirection='Inbound');
let inboundforward = (NTANetAnalytics
| where ipv4_is_in_range(SrcIp, '10.0.0.0/8') and ipv4_is_in_range(DestIp, '192.168.0.0/16')
| summarize TotalBytes=sum(BytesSrcToDest)
| extend FlowDirection='Inbound', TransferDirection='Inbound');
let inboundreturn = (NTANetAnalytics
| where ipv4_is_in_range(SrcIp, '10.0.0.0/8') and ipv4_is_in_range(DestIp, '192.168.0.0/16')
| summarize TotalBytes=sum(BytesDestToSrc)
| extend FlowDirection='Inbound', TransferDirection='Outbound');
union outboundforward, outboundreturn, inboundforward, inboundreturn

The reason why all outbound flows in the previous query are zero is that traffic from Azure to on-premises bypasses the ExpressRoute gateway (except return traffic from private endpoints), following this pattern:

Long story short: for trending on your ExpressRoute statistics, Traffic Collector will be more efficient than VNet Flow Logs.

App Discovery

The fact that Traffic Collector flows are stateless means that sometimes the source and destination ports are not correctly identified. For example, in the next output the third row shows port 80 as source, where it is actually the destination. Traffic Collector just logs packets as it sees them, but it has no concept of who started the TCP connection:

ATCExpressRouteCircuitIpfix
| where TimeGenerated > ago(12h)
| summarize TotalBytes=sum(NumberOfBytes) by SourceIp, DestinationIp, Protocol, SourcePort, DestinationPort
| sort by TotalBytes desc

You might think that VNet Flow Logs would be better for this. Normally yes, but the GatewaySubnet is an exception. The reason is again that half of the packets are missing because traffic from Azure to on-premises bypasses the ExpressRoute gateway, so VNet Flow Logs cannot identify the correct direction:

NTANetAnalytics
| where TimeGenerated > ago(2h)
| where ipv4_is_private(SrcIp) and ipv4_is_private(DestIp)
| summarize TotalBytes=sum(BytesDestToSrc+BytesSrcToDest) by SrcIp,DestIp, L4Protocol, SrcPort, DestPort
| sort by TotalBytes desc

Consequence here: neither Traffic Collector nor VNet Flow Logs in the Gateway Subnet are great. You probably want to enable VNet Flow Logs in some other subnet, such is in the AzureFirewallSubnet or your NVA’s subnet.

MSEE-Hairpinned traffic

Under some circumstances you can have traffic between spokes being routed by the ExpressRoute edge router. This is typically not a good idea, because it introduces additional latency and performance limitations, however some networks inadvertently do this if advertising wide summaries from on-premises:

Here though, the ExpressRoute gateway will see both directions of each flow, because for each communication part (spoke1-to-spoke2 and spoke2-to-spoke1), at least one leg will hit the GatewaySubnet:

NTANetAnalytics
| where TimeGenerated > ago(1h)
| where ipv4_is_private(SrcIp) and ipv4_is_private(DestIp)
| where (ipv4_is_in_range(SrcIp, '192.168.0.0/16') and ipv4_is_in_range(DestIp, '192.168.0.0/16')) and not (ipv4_is_in_range(SrcIp, '192.168.64.0/27') and ipv4_is_in_range(DestIp, '192.168.64.0/27'))
| summarize TotalBytes=sum(BytesDestToSrc+BytesSrcToDest) by SrcIp,DestIp, L4Protocol, L7Protocol,DestPort
| where TotalBytes > 0
| sort by TotalBytes desc

However, you can’t see spoke-to-spoke traffic in Traffic Collector, because these packets don’t hit the customer-facing router interfaces where the flow collection is happening:

ATCExpressRouteCircuitIpfix
| where TimeGenerated > ago(12h)
| where ipv4_is_in_range(SourceIp, '192.168.0.0/16') and ipv4_is_in_range(DestinationIp, '192.168.0.0/16')
| summarize TotalBytes=sum(NumberOfBytes) by SourceIp, DestinationIp, Protocol, SourcePort, DestinationPort
| sort by TotalBytes desc

Control Plane

You might want to capture control traffic such as BGP. However, due to the sampled nature of Traffic Collector, this can be tricky in high-bandwidth circuits. For example, in the figure below I can’t see any BGP packet. I do see ICMP traffic though, which is a significant difference as compared to VNet Flow Logs:

ATCExpressRouteCircuitIpfix
| where TimeGenerated > ago(1d)
| where ipv4_is_in_range(SourceIp, '169.254.197.24/29') and ipv4_is_in_range(DestinationIp, '169.254.197.24/29')
| summarize TotalBytes=sum(NumberOfBytes) by SourceIp, DestinationIp, Protocol, SourcePort, DestinationPort, IcmpType
| where TotalBytes > 0
| sort by TotalBytes desc

With VNet Flow Logs the BGP connectivity between the gateways is visible straight away:

NTANetAnalytics
| where TimeGenerated > ago(1d)
| where ipv4_is_in_range(SrcIp, '192.168.64.0/27') and ipv4_is_in_range(DestIp, '192.168.64.0/27')
| summarize TotalBytes=sum(BytesSrcToDest+BytesDestToSrc) by SrcIp, DestIp, L4Protocol, SrcPort, DestPort
| where TotalBytes > 0
| sort by TotalBytes desc

Circuit load balancing

If you want to inspect the load on each of the two lines that conform an ExpressRoute circuit, you can use the field NextHop of Traffic Collector logs. However, this only works in one direction: from Azure to on-premises. The opposite direction is not really useful, since the next hop is in Microsoft’s backbone and represented with 0.0.0.0:

Here a representation of both directions, where only Azure-to-on-premises is split over both lines:

ATCExpressRouteCircuitIpfix
| where TimeGenerated > ago(3h)
| summarize TotalBytes=sum(NumberOfBytes) by NextHop, bin(TimeGenerated, 10m)
| render timechart

Conclusion

There are other interesting fields in both VNet Flow Logs and Traffic Collector that will give you interesting information. For example, in Traffic Collector you have the fields IpClassOfService, Dot1qVlanId and Dot1qCustomerVlanId, if you are curious about this kind of things.

All in all, depending on your goals for traffic visibility, using Traffic Collector or VNet Flow Logs might be more convenient. My guess is that most folks out there are going to end up using both, for different use cases.

What are your thoughts?

4 thoughts on “ExpressRoute traffic visibility: Flow Logs or Traffic Collector?

  1. What is the actual use case for the express route traffic collector, could you share some real world scenarios where it can bring added value in troubleshooting and how?

    I find it very difficult to correlate any kind of data from ER traffic collector for example if I am collecting network traces on-prem or in all the hops between azure – (transit network) – onprem since the data is not real time and matching under any criteria like src port or tcp flags I find it very difficult or impossible.

    Also what I would like to give feedback on the documentation especially the API Schema is lack of proper explanation, It’s too generic and hard to understand.

    Flowsequence long Flow sequence of this flow. (what does this actually represent since it’s not the sequence number of the flow that you would normally see in tcp dump)

    Like

    1. Hey Stefan! The main use case I have seen in the organizations I have worked with is knowing which applications are using the bandwidth (and how much) in ExpressRoute.

      This helps to size the circuits correctly, and potentially to optimize the app architecture to place its components in the optimal locations.

      Like

  2. Excellent post Jose! Thanks so much for putting this together. You covered the benefits and considerations really well which makes decisions around this stuff SO much easier.

    Like

    1. Thanks Matt, happy it helps!

      Like

Leave a comment