ExpressRoute traffic visibility: Flow Logs or Traffic Collector?

You might have heard about VNet Flow Logs, I posted about this new Azure feature here. One of the applications of VNet Flow Logs is to gain visibility into traffic in places that had been blind spots until now, such as in the Gateway Subnets to inspect traffic on VPN or ExpressRoute.

Talking about ExpressRoute, there is another feature that gives you traffic visibility: ExpressRoute Traffic Collector. This functionality was until recently only available for ExpressRoute Direct circuits, but since this is now working on provider-managed ExpressRoute circuits as well for bandwidths of 1Gbps or higher (see this FAQ), I have finally been able to test it.

Before we dive in, a word of caution: VNet Flow Logs are very versatile, and in this blog post I will primarily focus on using them in the gateway subnet of an Azure Virtual Network for ExpressRoute. Other usages of VNet Flow Logs, such as using them in the gateway subnet for VPN gateway traffic, in the firewall subnet or anywhere else are out of scope for this post.

The first main difference between the two is where the logs are taken: while VNet Flow Logs can capture traffic at any given subnet, specifically in the GatewaySubnet in this example, ExpressRoute Traffic Collector will collect the logs in the external interfaces of the Microsoft edge routers for ExpressRoute (often referred to as MSEE or Microsoft Enterprise Edge). Consequently, the traffic you will capture with each approach will be different:

A second difference is the fact that Traffic Collector will sample data with a 1-to-4,096 ratio (only one out of 4,096 packets will be logged). While this is great from a cost perspective, you need to take this into account when drawing your conclusions.

Finally, the mechanisms used to record flows are very different in both cases, so consequently the fields stored for each flow are different too. But more to this later.

Topology

Let’s start with the topology that I have used to test this. I have a traditional hub-and-spoke environment, with two spokes and an ExpressRoute gateway:

In the ExpressRoute circuit I have a private peering configured:

I am simulating some traffic flows:

  • From Azure to on-premises.
  • From on-premises to Azure.
  • From spoke to spoke (this traffic is hairpinned to the MSEE).

Trending

One of the most frequent use cases for traffic visibility is evaluating the amount of traffic going through ExpressRoute. It is very easy to build this kind of perspective with Traffic Collector and separating between inbound/outbound and outbound traffic (note that in the query below you would have to multiply by 4,096 to get more approximate values):

ATCExpressRouteCircuitIpfix
| where TimeGenerated > ago(2h)
| extend Direction = iff(DstAsn == 65515, "Inbound", "Outbound")
| summarize Bytes=4096*sum(NumberOfBytes) by bin(TimeGenerated, 5m), Direction
| render timechart

Note in the query below how the DstAsn field is very convenient to separate traffic going to Azure (always with the Autonomous System Number 65515) from traffic going to on-premises.

With VNet Flow Logs this kind of statistic is more complicated. Firstly, only traffic coming from on-premises is logged, since Azure-to-on-premises packets don’t traverse the ExpressRoute Gateway.

Additionally, the way in which the information is logged in VNet Flow Logs is different: while Traffic Collector is “stateless” (it will record how much info goes from A to B), VNet Flow Logs are stateful: they record who started the connection, and store how much traffic goes in which direction in the fields BytesSrcToDest and BytesDestToSrc. Hence, for all bytes going from A to B you need to add BytesSrcToDest for flows from A to B, and BytesDestToSrc for flows from B to A:

let outboundforward = (NTANetAnalytics
| where ipv4_is_in_range(SrcIp, '192.168.0.0/16') and ipv4_is_in_range(DestIp, '10.0.0.0/8')
| summarize TotalBytes=sum(BytesSrcToDest)
| extend FlowDirection='Outbound', TransferDirection='Outbound');
let outboundreturn = (NTANetAnalytics
| where ipv4_is_in_range(SrcIp, '192.168.0.0/16') and ipv4_is_in_range(DestIp, '10.0.0.0/8')
| summarize TotalBytes=sum(BytesDestToSrc)
| extend FlowDirection='Outbound', TransferDirection='Inbound');
let inboundforward = (NTANetAnalytics
| where ipv4_is_in_range(SrcIp, '10.0.0.0/8') and ipv4_is_in_range(DestIp, '192.168.0.0/16')
| summarize TotalBytes=sum(BytesSrcToDest)
| extend FlowDirection='Inbound', TransferDirection='Inbound');
let inboundreturn = (NTANetAnalytics
| where ipv4_is_in_range(SrcIp, '10.0.0.0/8') and ipv4_is_in_range(DestIp, '192.168.0.0/16')
| summarize TotalBytes=sum(BytesDestToSrc)
| extend FlowDirection='Inbound', TransferDirection='Outbound');
union outboundforward, outboundreturn, inboundforward, inboundreturn

The reason why all outbound flows in the previous query are zero is that traffic from Azure to on-premises bypasses the ExpressRoute gateway (except return traffic from private endpoints), following this pattern:

Long story short: for trending on your ExpressRoute statistics, Traffic Collector will be more efficient than VNet Flow Logs.

App Discovery

The fact that Traffic Collector flows are stateless means that sometimes the source and destination ports are not correctly identified. For example, in the next output the third row shows port 80 as source, where it is actually the destination. Traffic Collector just logs packets as it sees them, but it has no concept of who started the TCP connection:

ATCExpressRouteCircuitIpfix
| where TimeGenerated > ago(12h)
| summarize TotalBytes=sum(NumberOfBytes) by SourceIp, DestinationIp, Protocol, SourcePort, DestinationPort
| sort by TotalBytes desc

You might think that VNet Flow Logs would be better for this. Normally yes, but the GatewaySubnet is an exception. The reason is again that half of the packets are missing because traffic from Azure to on-premises bypasses the ExpressRoute gateway, so VNet Flow Logs cannot identify the correct direction:

NTANetAnalytics
| where TimeGenerated > ago(2h)
| where ipv4_is_private(SrcIp) and ipv4_is_private(DestIp)
| summarize TotalBytes=sum(BytesDestToSrc+BytesSrcToDest) by SrcIp,DestIp, L4Protocol, SrcPort, DestPort
| sort by TotalBytes desc

Consequence here: neither Traffic Collector nor VNet Flow Logs in the Gateway Subnet are great. You probably want to enable VNet Flow Logs in some other subnet, such is in the AzureFirewallSubnet or your NVA’s subnet.

MSEE-Hairpinned traffic

Under some circumstances you can have traffic between spokes being routed by the ExpressRoute edge router. This is typically not a good idea, because it introduces additional latency and performance limitations, however some networks inadvertently do this if advertising wide summaries from on-premises:

Here though, the ExpressRoute gateway will see both directions of each flow, because for each communication part (spoke1-to-spoke2 and spoke2-to-spoke1), at least one leg will hit the GatewaySubnet:

NTANetAnalytics
| where TimeGenerated > ago(1h)
| where ipv4_is_private(SrcIp) and ipv4_is_private(DestIp)
| where (ipv4_is_in_range(SrcIp, '192.168.0.0/16') and ipv4_is_in_range(DestIp, '192.168.0.0/16')) and not (ipv4_is_in_range(SrcIp, '192.168.64.0/27') and ipv4_is_in_range(DestIp, '192.168.64.0/27'))
| summarize TotalBytes=sum(BytesDestToSrc+BytesSrcToDest) by SrcIp,DestIp, L4Protocol, L7Protocol,DestPort
| where TotalBytes > 0
| sort by TotalBytes desc

However, you can’t see spoke-to-spoke traffic in Traffic Collector, because these packets don’t hit the customer-facing router interfaces where the flow collection is happening:

ATCExpressRouteCircuitIpfix
| where TimeGenerated > ago(12h)
| where ipv4_is_in_range(SourceIp, '192.168.0.0/16') and ipv4_is_in_range(DestinationIp, '192.168.0.0/16')
| summarize TotalBytes=sum(NumberOfBytes) by SourceIp, DestinationIp, Protocol, SourcePort, DestinationPort
| sort by TotalBytes desc

Control Plane

You might want to capture control traffic such as BGP. However, due to the sampled nature of Traffic Collector, this can be tricky in high-bandwidth circuits. For example, in the figure below I can’t see any BGP packet. I do see ICMP traffic though, which is a significant difference as compared to VNet Flow Logs:

ATCExpressRouteCircuitIpfix
| where TimeGenerated > ago(1d)
| where ipv4_is_in_range(SourceIp, '169.254.197.24/29') and ipv4_is_in_range(DestinationIp, '169.254.197.24/29')
| summarize TotalBytes=sum(NumberOfBytes) by SourceIp, DestinationIp, Protocol, SourcePort, DestinationPort, IcmpType
| where TotalBytes > 0
| sort by TotalBytes desc

With VNet Flow Logs the BGP connectivity between the gateways is visible straight away:

NTANetAnalytics
| where TimeGenerated > ago(1d)
| where ipv4_is_in_range(SrcIp, '192.168.64.0/27') and ipv4_is_in_range(DestIp, '192.168.64.0/27')
| summarize TotalBytes=sum(BytesSrcToDest+BytesDestToSrc) by SrcIp, DestIp, L4Protocol, SrcPort, DestPort
| where TotalBytes > 0
| sort by TotalBytes desc

Circuit load balancing

If you want to inspect the load on each of the two lines that conform an ExpressRoute circuit, you can use the field NextHop of Traffic Collector logs. However, this only works in one direction: from Azure to on-premises. The opposite direction is not really useful, since the next hop is in Microsoft’s backbone and represented with 0.0.0.0:

Here a representation of both directions, where only Azure-to-on-premises is split over both lines:

ATCExpressRouteCircuitIpfix
| where TimeGenerated > ago(3h)
| summarize TotalBytes=sum(NumberOfBytes) by NextHop, bin(TimeGenerated, 10m)
| render timechart

Conclusion

There are other interesting fields in both VNet Flow Logs and Traffic Collector that will give you interesting information. For example, in Traffic Collector you have the fields IpClassOfService, Dot1qVlanId and Dot1qCustomerVlanId, if you are curious about this kind of things.

All in all, depending on your goals for traffic visibility, using Traffic Collector or VNet Flow Logs might be more convenient. My guess is that most folks out there are going to end up using both, for different use cases.

What are your thoughts?

20 thoughts on “ExpressRoute traffic visibility: Flow Logs or Traffic Collector?

  1. What is the actual use case for the express route traffic collector, could you share some real world scenarios where it can bring added value in troubleshooting and how?

    I find it very difficult to correlate any kind of data from ER traffic collector for example if I am collecting network traces on-prem or in all the hops between azure – (transit network) – onprem since the data is not real time and matching under any criteria like src port or tcp flags I find it very difficult or impossible.

    Also what I would like to give feedback on the documentation especially the API Schema is lack of proper explanation, It’s too generic and hard to understand.

    Flowsequence long Flow sequence of this flow. (what does this actually represent since it’s not the sequence number of the flow that you would normally see in tcp dump)

    Like

    1. Hey Stefan! The main use case I have seen in the organizations I have worked with is knowing which applications are using the bandwidth (and how much) in ExpressRoute.

      This helps to size the circuits correctly, and potentially to optimize the app architecture to place its components in the optimal locations.

      Like

  2. Excellent post Jose! Thanks so much for putting this together. You covered the benefits and considerations really well which makes decisions around this stuff SO much easier.

    Like

    1. Thanks Matt, happy it helps!

      Like

  3. Rahul's avatarRahul

    Thank you for the detailed explanation thru this blog Jose, appreciate your efforts to make us on field folks gain knowledge. I wanted to check one thing from above then, if the customer is interested in creating an Alert whenever the bandwidth drops below a specific threshold can I simply use the VnetFlow logs here and the part of the query you shared and introduce a threshold comparison instead of trying to enable the ER Traffic collector?

    NTANetAnalytics | where ipv4_is_in_range(SrcIp, ‘10.0.0.0/8’) and ipv4_is_in_range(DestIp, ‘192.168.0.0/16′) | summarize TotalBytes=sum(BytesSrcToDest) | extend FlowDirection=’Inbound’, TransferDirection=’Inbound’) | where TotalBytes < Threshold

    Like

    1. Happy it is helpful Rahul! For an alert when traffic drops below a given threshold I would rather use the ExR circuit’s metrics.

      Like

      1. generouslyc1d4796538's avatargenerouslyc1d4796538

        Thank you so much Jose! Just want to confirm my understanding here so basically using the BitsIn as documented here would be sufficient, is my understanding correct? Thank you for your guidance. https://learn.microsoft.com/en-us/azure/expressroute/expressroute-monitoring-metrics-alerts#circuits-metrics

        Like

      2. Yepp, that’s the one!

        Liked by 1 person

  4. […] detect traffic anomalies, about different ways to access NSG Flow Logs, and more recently about the main functional differences between VNet FLow Logs and ExpressRoute Traffic Collector as well as some sample queries to query VNet Flow […]

    Like

  5. generouslyc1d4796538's avatargenerouslyc1d4796538

    Hello Jose,

    At one of my customers, they have lot some apps in Azure which are performing really bad while in WEST US3 than EASTUS. I know that their MSEE is in EASTUS only and that is obviously going to cause latency. Now unfortunately there old tech team was laid off and members left dont know much about how the traffic flows. We have requested them to enable vnet flow logs and I suspect there is some traffic flowing to on-prem (EASTUS again) from WEST US3 and I was trying to figure out how I can query for Azure2OnPrem traffic for some specific apps (I can use the IP Address space from Azure WEST US3 as source) but confused a bit on how I can identify the traffic to On-prem. I read your other blogs on Flow logs receipes and got the query below. But just not sure if this will get me details on for what query they are going from Azure2OnPrem (for e.g. DNS lookup, etc.). Any recommendation/suggestion on if I can use Vnet flow logs to check for traffic from Azure2OnPrem? thank you

    let prefix1=”10.4.0.0/16″; let prefix2=”10.1.0.0/16″; NTANetAnalytics | where SubType == ‘FlowLog’ and FaSchemaVersion == ‘3’ and FlowStartTime > ago(24h) | extend SrcIpIsInPrefix1 = ipv4_is_in_range(SrcIp, prefix1), SrcIpIsInPrefix2 = ipv4_is_in_range(SrcIp, prefix2) | extend DestIpIsInPrefix1 = ipv4_is_in_range(DestIp, prefix1), DestIpIsInPrefix2 = ipv4_is_in_range(DestIp, prefix2) | where (SrcIpIsInPrefix1 and DestIpIsInPrefix2) or (SrcIpIsInPrefix2 and DestIpIsInPrefix1) | extend Direction = iff((SrcIpIsInPrefix1 and DestIpIsInPrefix2), “Onprem2Azure”, “Azure2Onprem”) | summarize TotalBytesSrcToDest=sum(BytesSrcToDest), TotalBytesDestToSrc=sum(BytesDestToSrc) by Direction | render columnchart

    Like

    1. Hey there! The query looks legit, although where are you capturing? If in the GatewaySubnet, consider that you are never going to see Azure2onprem traffic, and onprem2Azure only is FastPath is not enabled.

      Like

      1. generouslyc1d4796538's avatargenerouslyc1d4796538

        thank you so much for quick response.

        I have enabled flow logs on all spokes (app & db,) and hub as well. Just wondering if prefix1 prefix2 should be the spoke supernet. Also hoping to spit or show calls going to on prem and that latency is inevitable due to west n east distance. But since there logging is not in plc trying to see if vnet flow logs can help show those calls being made.

        Like

  6. John's avatarJohn

    Is there a way to see replying traffic in vnet logs? I know for sure the traffic exists because i see it in the VM, but doesn’t matter which query i use i never see this traffic as inbound from this source public IP.

    Like

    1. Hey John, thanks for reading! Not sure what you mean with “replying traffic”. If you mean return packets in the same TCP connection, such as for example in a HTTP GET request, the traffic volume is stored in the fields BytesDestToSrc and PacketsDestToSrc.

      If by “replying traffic” you mean that the destination initiates a new TCP connection, such as it is the case in active FTP, only then you will see a new flow entry.

      Does that make sense?

      Like

      1. John's avatarJohn

        Hi Erjosito, thanks a lot for getting back to me. I’m seeing SIP flows between my VM and the service provider, but no matter how I query the VNet flow logs, I only ever see outbound traffic to them—never any inbound traffic. My assumption is that since the VM initiates the connection, all subsequent communication uses that same connection, and the VNet flow logs categorize it as outbound.My question is: is there any way to view this traffic as inbound, reflecting how it actually behaves?

        Like

      2. Not sure what you mean with “how it actually behaves”. If the VM initiates the connection, then it is an outbound flow, period.

        Like

      3. John's avatarJohn

        Well, the VM sends SIP OPTIONS packet, provider replies with 200 OK packet. This never appears in vnet flow logs, but i clearly can see those in the VM itself.

        Like

      4. My point is that you should look at VNet Flow Logs as exactly that: logs for TCP flows. So if the SIP TCP flows are initiated by the VM, all of the flow records will have the VM as source, and you shouldn’t expect that it shows up as a destination.

        Like

  7. John's avatarJohn

    I see, that makes sense, thank you.

    Like

Leave a reply to erjosito Cancel reply