VNet Flow Logs Recipes (part 2): fine-tune your security rules

Right when VNet Flow Logs were launched I blogged about some recipes that help to extract insights out of the different information fields contained in the Flow Logs. After working with VNet Flow Logs and Traffic Analytics for a while now, I thought I could share some additional tips and tricks, this time focusing on the enrichments that Traffic Analytics brings to the table, which will help you to fine tune your security rules.

Especially if you come from on-premises networking, your network security policy might be quite coarse: everything is controlled by a central firewall, that would not have visibility into certain flows such as traffic between systems in the same subnet. However, public cloud let’s you apply much more granular access controls in the form of Network Security Groups and Security Admin Rules. Before starting to make your security policy more granular you need to know which traffic you should allow and which traffic you should drop. Unless you have an exceptionally good application documentation, you are going to have to rely on traffic visibility to find this out.

Most of the recipes that I will cover in this post are targeted to this goal: either identifying traffic which is allowed but it shouldn’t, or highlighting flows that you should permit in your rules for your applications to work properly. For example, I am going to show you how to identify flows going or coming from the public Internet which you might not want, as well as how to flag flows that have been identified by Traffic Analytics as being malicious and have been allowed into your network, which might indicate some breach into the organization. Especially when looking at malicious traffic that can potentially overtake local machines, you need to make sure that these breaches are contained, so it is critical to verify whether this malicious traffic is targeting single virtual machines or whether they extend to whole subnets or even virtual networks.

The cherry on the cake of this blog post will be how to enrich Traffic Analytics information with Fully Qualified Domain Names (FQDN) coming from Azure DNS Policy logs, so that you don’t have to guess what a given public IP stands for.

Which traffic should I allow or deny to my workloads?

Typically you would align your application workload types to your subnets. For example, you would have a dedicated subnet for your production databases of each application, another one for the frontend servers, and so forth. This allows to specify the security rules for that specific workload in a Network Security Group (NSG) that will be applied at the subnet level. Note that even if the NSG is applied at the subnet level, it is still enforced at each NIC, allowing micro-segmentation or filtering of intra-subnet traffic.

NTANetAnalytics
| where TimeGenerated > ago(60d)
// Optionally filter for TCP/UDP ports below 1024
//| where DestPort <= 1024
// Optionally filter by denied/allowed flows
| extend AllowedFlows = AllowedInFlows + AllowedOutFlows
| extend DeniedFlows = DeniedInFlows + DeniedOutFlows
//| where DeniedFlows > 0
//| where AllowedFlows > 0
// Classify Src/Dest in private/public and optionally filter
| extend DestType=iff(isnotempty(DestSubnet), "Private", "Public")
| extend SrcType=iff(isnotempty(SrcSubnet), "Private", "Public")
//| where SrcType == "Private" and DestType == "Private"
// Coalesce Source and Destination
| extend Source = tostring(coalesce(SrcSubnet, SrcType))
| extend Destination = tostring(coalesce(DestSubnet, DestType))
| where isnotempty(Destination) and isnotempty(Source)
// Put together protocol and destination port
| extend App = strcat(L4Protocol, DestPort)
// Summarize
| summarize AllowedFlows=sum(AllowedFlows), DeniedFlows=sum(DeniedFlows), TransferredBytes=sum(BytesSrcToDest+BytesDestToSrc) by Source, Destination, App
| order by TransferredBytes desc

This is what it looks like in my lab:

This would give you an idea of whether you should add new deny rules to block specific flow types. For example, from the table above it looks like there is some allowed unencrypted traffic from some subnets outbound to the public Internet, which we might want to block with NSGs or security admin rules.

There are many ways to pivot around this information. For example, you could look for internal traffic between subnets, either allowed or denied, since this might indicate either lateral move attempts or legitimate traffic. You could use a variation of the previous query:

NTANetAnalytics
| where TimeGenerated > ago(60d)
// Optionally filter for TCP/UDP ports below 1024
//| where DestPort <= 1024
// Optionally filter by denied/allowed flows
| extend AllowedFlows = AllowedInFlows + AllowedOutFlows
| extend DeniedFlows = DeniedInFlows + DeniedOutFlows
//| where DeniedFlows > 0
//| where AllowedFlows > 0
// Classify Src/Dest in private/public and optionally filter
| extend DestType=iff(isnotempty(DestSubnet), "Private", "Public")
| extend SrcType=iff(isnotempty(SrcSubnet), "Private", "Public")
| where SrcType == "Private" and DestType == "Private"
// Coalesce Source and Destination
| extend Source = tostring(coalesce(SrcSubnet, SrcType)), Destination = tostring(coalesce(DestSubnet, DestType))
| where isnotempty(Destination) and isnotempty(Source)
// Put together protocol and destination port
| extend App = strcat(L4Protocol, DestPort)
// Summarize
| summarize AllowedFlows=sum(AllowedFlows), DeniedFlows=sum(DeniedFlows), TransferredBytes=sum(BytesSrcToDest+BytesDestToSrc) by Source, Destination, App
| order by TransferredBytes desc

And here a sample result:

This output shows me that I have very little traffic between my subnets. It is essentially only DNS traffic from each subnets to the Azure Firewall, since it is configured as DNS proxy. This is telling me that I should probably create an NSG that only allows UDP port 53 between subnets.

Not everybody is aligning subnets to workload types though. Especially if you have many small services conforming your app you might deploy them in the same subnet. There is a whole different discussion on how to approach this: still with subnet-level NSGs that cover all of the microservices in the subnet, with NIC-level NSGs that will be micro-service-specific, using Application Security Groups to make things a little bit easier… This is outside of the scope of this post, but suffice to say that you will still need to identify which traffic is required for each of your micro-services.

Cross-boundary traffic

You could take the previous example of subnet summaries further and look at grouping at other hierarchies. Among the additional fields with which Traffic Analytics enriches the information from VNet Flow Logs you can find metadata of the sending and receiving virtual machines, such as their region, resource group, virtual network, subnet and location. It might be interesting to summarize the traffic that is going over your network boundaries. For example, you might think that you do not have too much traffic going across regions, so you could have this query that gives you intra- and inter-regional traffic to confirm your hypothesis:

NTANetAnalytics
| where TimeGenerated > ago(60d)
| project SrcRegion, DestRegion, BytesDestToSrc, BytesSrcToDest
| where isnotempty(SrcRegion) and isnotempty(DestRegion)
| summarize TransferredBytes=sum(BytesDestToSrc+BytesSrcToDest) by SrcRegion, DestRegion

In my particular case it looks like I only have intra-region traffic in the East US Azure region, which is a good thing, proving that I don’t have any cross-region traffic:

You could perform a similar analysis pivoting on other fields, such as source and destination subscriptions, resource groups, subnets, VNets, etc. Here you have an example using subscriptions, to highlight intra- and inter-subscription traffic:

NTANetAnalytics
| where TimeGenerated > ago(60d)
| project SrcSubscription, DestSubscription, BytesDestToSrc, BytesSrcToDest
| where isnotempty(SrcSubscription) and isnotempty(DestSubscription)
| summarize TransferredBytes=sum(BytesDestToSrc+BytesSrcToDest) by SrcSubscription, DestSubscription

Break down per FlowType

Traffic Analytics does a first classification of the ingested flows that you can use for further investigation. With this simple query you can get a first view of how the flows to and from your system are distributed across different categories:

NTANetAnalytics
| where TimeGenerated > ago(30d)
| where SubType=="FlowLog"
| summarize RecordNumber=count() by FlowType
| render piechart

Here a sample representation of my test data, where you can see the massive amount of external flows as compared to the rest, which might already raise some alarm bells:

In the Notes section of the Traffic Analytics schema doc you can see a definition of what each flow type means.

Threat type

The first flow type you are probably going to want to look at is MaliciousFlow. Traffic Analytics offers you additional information in the form of ThreadType and ThreadDescription fields that you can leverage. An easy query on the ThreatType field already gives us good information:

NTAIpDetails
| where TimeGenerated > ago(30d)
| where FlowType == "MaliciousFlow"
| summarize count() by ThreatType
| render piechart

This is what it looks like in my lab:

Malicious flows detailed info

Of course we can dig deeper in these malicious flows. For example, it is important distinguishing between inbound and outbound malicious flows. You can do that and much more with the techniques in this query (thanks to Niti Gupta for this one!):

// Variable to make time filter easier
let lookback = ago(30d);
// Get the malicious IP in a table for cross-referencing later on
let MaliciousIps = NTAIpDetails 
    | where TimeGenerated > lookback
    | where FlowType == "MaliciousFlow"
    | project Ip, PublicIpDetails, ThreatType, ThreatDescription, Location, Url, DnsDomain
    | distinct *;
// Get the main fields out of flow logs marked as malicious
let MaliciousFlows =  NTANetAnalytics 
    | where TimeGenerated > lookback
    | where SubType == "FlowLog" and FlowType == "MaliciousFlow"
    | project SrcIp, DestIp, SrcVm, DestVm, SrcSubscription, DestSubscription; 
// Get the subset of the malicious flows where the **src** IP is the malicious one
let SrcMalicious = MaliciousFlows 
    | lookup kind=inner MaliciousIps on $left.SrcIp == $right.Ip
    | extend CompromisedVM = iff(isnotempty(DestVm), strcat("/subscriptions/", DestSubscription, "/resourceGroups/", tostring(split(DestVm, "/")[0]), "/providers/Microsoft.Compute/virtualMachines/", tostring(split(DestVm, "/")[1])), '')
    | project
        MaliciousIp = strcat('🌐 ', SrcIp),
        CompromisedIp = strcat('🖥️', DestIp),
        CompromisedVM, PublicIpDetails, ThreatType, DnsDomain, ThreatDescription, Location, Url;
// Get the subset of the malicious flows where the **dst** IP is the malicious one
let DestMalicious = MaliciousFlows 
    | lookup kind=inner MaliciousIps on $left.DestIp == $right.Ip 
    | extend CompromisedVM = iff(isnotempty(SrcVm), strcat("/subscriptions/", SrcSubscription, "/resourceGroups/", tostring(split(SrcVm, "/")[0]), "/providers/Microsoft.Compute/virtualMachines/", tostring(split(SrcVm, "/")[1])), '')
    | project
        CompromisedIp = strcat('🖥️ ', SrcIp),
        MaliciousIp = strcat('🌐 ', DestIp),
        CompromisedVM, PublicIpDetails, ThreatType, DnsDomain, ThreatDescription, Location, Url;
// Put both together
SrcMalicious | union DestMalicious

Example from my lab:

You can use this table for multiple use cases: for example, you can use the Location field to represent the records in a map (see my post of VNet Flow Logs and Grafana for an example of that), or you can use this table to identify compromised IPs and even lateral moves of the attackers.

As potential extension of the previous query you could add the fields AllowedInFlows, AllowedOutFlows, DeniedInFlows and DeniedOutFlows to figure out whether these flows have been allowed or denied.

Extracting public IPs for external flows

Public IP addresses are consolidated as follows, as described in the Traffic Analytics schema and consolidation with a series of records:

57.150.182.65|198|196|1215|1823|292190|1803234 20.150.82.228|4|4|33|31|7960|22256 20.190.151.68|1|2|40|37|15470|25652

It is a list of space-separated values, where each value is again a pipe-separated list containing the public IP address as well as the start and end count and outbound/inbound packets and bytes. You can use the KQL split function two times (one with the separator ” ” and another one with the separator “|”) to retrieve the public IP address. If you are interested in the packets or bytes, of course you can use a variation of this to get those values (you would be interested in the last four values of each record, which are outbound/inbound packets and outbound/inbound bytes, in this order).

// Simplified version of NTAIpDetails with fewer fields
let IpDetailsSubset = NTAIpDetails
| project Ip, Location, PublicIpDetails;
// Main query
NTANetAnalytics
| where TimeGenerated > ago(30d)
| where SubType == "FlowLog" and FlowType in ("AzurePublic", "ExternalPublic")
| where isnotempty(DestPublicIps) or isnotempty(SrcPublicIps)
| project TimeGenerated, SrcIp, SrcPublicIps, DestIp, DestPublicIps, FlowType
// Standard format for DestPublicIp
| extend DestPublicIpsList = split(DestPublicIps, ' ')
| mv-expand DestPublicIpsList
| extend DestIp = iff(isempty(DestIp), tostring(split(DestPublicIpsList, '|')[0]), DestIp)
| project-away DestPublicIpsList, DestPublicIps
// Standard format for SrcPublicIp
| extend SrcPublicIpsList = split(SrcPublicIps, ' ')
| mv-expand SrcPublicIpsList
| extend SrcIp = iff(isempty(SrcIp), tostring(split(SrcPublicIpsList, '|')[0]), SrcIp)
| project-away SrcPublicIpsList, SrcPublicIps
// Create "Direction" field
| extend Direction = iff(isempty(SrcIp), 'Inbound', 'Outbound')
// Enrich with NTAIpDetails, both for src and dest
| lookup kind=leftouter IpDetailsSubset on $left.DestIp==$right.Ip
| lookup kind=leftouter IpDetailsSubset on $left.SrcIp==$right.Ip
// Consolidate Location and IPinfo
| extend Location = coalesce(Location, Location1) | project-away Location1
| extend PublicIpDetails = coalesce(PublicIpDetails, PublicIpDetails1) | project-away PublicIpDetails1

The coalesce function is just a way of getting the first non-empty value out of a list, so that we can consolidate the output of the two lookups into a single field.

Example from my lab:

This kind of query can be used for geographic representations or to highlight flows that Traffic Analytics has not marked as malicious, but could be going to countries such as North Korea with which your organization has no business with.

Another use case for this query is to detect traffic to Azure Cloud’s public IP addresses that should be using Private Link instead.

Focus on allowed malicious traffic

The previous queries on malicious or public traffic do not tell you whether that traffic was allowed or not, but it is easy to include this information. For example, starting from the query for the malicious flows we can add a couple more fields that show whether the flows were allowed or denied, and which NSG or admin security rule was responsible for it. Some sprinkle of summarization and rendering will do the rest:

// Variable to make time filter easier
let lookback = ago(30d);
// Get the malicious IP in a table for cross-referencing later on
let MaliciousIps = NTAIpDetails 
    | where TimeGenerated > lookback
    | where FlowType == "MaliciousFlow"
    | project Ip, PublicIpDetails, ThreatType, ThreatDescription, Location, Url, DnsDomain
    | distinct *;
// Get the main fields out of flow logs marked as malicious
let MaliciousFlows =  NTANetAnalytics 
    | where TimeGenerated > lookback
    | where SubType == "FlowLog" and FlowType == "MaliciousFlow"
    | project SrcIp, DestIp, SrcVm, DestVm, SrcSubscription, DestSubscription, AclGroup, AclRule,AllowedInFlows, AllowedOutFlows, DeniedInFlows, DeniedOutFlows;
// Get the subset of the malicious flows where the **src** IP is the malicious one
let SrcMalicious = MaliciousFlows 
    | lookup kind=inner MaliciousIps on $left.SrcIp == $right.Ip
    | extend ImpactedVM = iff(isnotempty(DestVm), strcat("/subscriptions/", DestSubscription, "/resourceGroups/", tostring(split(DestVm, "/")[0]), "/providers/Microsoft.Compute/virtualMachines/", tostring(split(DestVm, "/")[1])), ''), Direction="Inbound"
    | extend AllowedFlows = AllowedInFlows, DeniedFlows=DeniedInFlows
    | project
        MaliciousIp = strcat('🌐 ', SrcIp),
        ImpactedLocalIp = strcat('🖥️', DestIp),
        ImpactedVM, PublicIpDetails, ThreatType, DnsDomain, ThreatDescription, Location, Url, Direction, SrcVm, DestVm,
        AclGroup, AclRule, AllowedFlows, DeniedFlows;
// Get the subset of the malicious flows where the **dst** IP is the malicious one
let DestMalicious = MaliciousFlows 
    | lookup kind=inner MaliciousIps on $left.DestIp == $right.Ip 
    | extend ImpactedVM = iff(isnotempty(SrcVm), strcat("/subscriptions/", SrcSubscription, "/resourceGroups/", tostring(split(SrcVm, "/")[0]), "/providers/Microsoft.Compute/virtualMachines/", tostring(split(SrcVm, "/")[1])), ''), Direction="Outbound"
    | extend AllowedFlows = AllowedOutFlows, DeniedFlows=DeniedOutFlows
    | project
        ImpactedLocalIp = strcat('🖥️ ', SrcIp),
        MaliciousIp = strcat('🌐 ', DestIp),
        ImpactedVM, PublicIpDetails, ThreatType, DnsDomain, ThreatDescription, Location, Url, Direction, SrcVm, DestVm,
        AclGroup, AclRule, AllowedFlows, DeniedFlows;
// Put both together
SrcMalicious | union DestMalicious
| where AllowedFlows > 0
| summarize MaliciousFlows=count() by AclRule
| render columnchart

So it looks like our NSG rule “allowsshin” is responsible for letting in most of the malicious traffic, maybe somebody ought to have a look at it and make it more restrictive?

Enrich with DNS information

One of the drawbacks of collecting information on the wire is that you can only see IP addresses, not Fully Qualified Domain Names. And yet, FQDNs can deliver very valuable information about what is actually represented by IP addresses. By the way, thanks to Abhishek Sharma for the idea on this one!

A first approach you might consider is using Azure Firewall DNS logs, assuming you are using Azure Firewall as DNS proxy. Nice try, but both legacy (log category AzureFirewallDnsProxy) and structured (log category AZFWDnsQuery) logs for Azure Firewall DNS queries do not store what the DNS response was, so these logs are pretty useless for our goal.

However, there is a new kid on the block: DNS Security Policies. This is a set of security rules that you can attach to a VNet, so that DNS requests are processed according to those rules. But more interesting for our use case, DNS Security Policies support Diagnostic Settings to send logs to a Log Analytics workspace, and these logs do include the resolved IP address!

Hence we can enrich the information related to the public IP addresses even more:

// Outbound traffic
// Simplified version of NTAIpDetails with fewer fields
let IpDetailsSubset = NTAIpDetails
| project Ip, Location, PublicIpDetails;
// DNS logs
let DNSlogs = DNSQueryLogs
| project QueryName, Answer
| mv-expand Answer
| where Answer.Type == "A"
| project FQDN=QueryName, IPAddress = tostring(Answer.RData);
// Main query
NTANetAnalytics
| where TimeGenerated > ago(30d)
| where SubType == "FlowLog" and FlowType in ("AzurePublic", "ExternalPublic")
| where isnotempty(DestPublicIps)
// Optional: filter only for flows that have not been denied
// Intuitively you would filter by AllowedOutFlows>0, but there are some records for which all Allowed/Denied counters are zero
| where DeniedOutFlows == 0
| project TimeGenerated, SrcIp, DestIp, DestPublicIps, FlowType
// Standard format for DestPublicIp
| extend DestPublicIpsList = split(DestPublicIps, ' ')
| mv-expand DestPublicIpsList
| extend DestIp = iff(isempty(DestIp), tostring(split(DestPublicIpsList, '|')[0]), DestIp)
| project-away DestPublicIpsList, DestPublicIps
// Enrich with NTAIpDetails
| lookup kind=leftouter IpDetailsSubset on $left.DestIp==$right.Ip
// Enrich with DNS resolution
| lookup kind=leftouter DNSlogs on $left.DestIp==$right.IPAddress
// Consolidate eliminating the timestamp
| distinct SrcIp, DestIp, FlowType, Location, PublicIpDetails, FQDN
| where isnotempty(FQDN)

And an example of my test bed (noted the FQDN column at the far right):

As you can see, the FQDN field (obtained by correlating to the DNS logs) offers much more information as compared to the PublicIpDetails of the NTAIpDetails table: even if you might think that the flows going out to Amazon Data Services could be a data exfiltration attempt, it turns out that it is probably a misconfiguration where some machines are still using the default Message Of The Day (MOTD) of Ubuntu, which by the way can be disabled via configuration. This way you can purposedly refine your security posture eliminating or blocking undesired flows.

Break down per VNet and subnet

It might be interesting to break down the malicious traffic or the external traffic obtained by the previous queries per VNet or subnet. The reason is that if a virtual machine has been compromised, the attacker might have moved laterally and infected other systems in the same subnet or virtual network, so you might have to quarantine whole segments. Or if it is a misconfiguration as the previous example indicates, you might want to contact the subnet owner instead of sending one message for every affected virtual machine (my laziness is driving a lot of my pet projects).

We can take the previous query for the allowed malicious flows and expand it to include the ‘ImpactedSubnet’ (either the source or the destination subnet, depending on the direction of the flow), and summarize by this field (in this example I am keeping the top 20 country-subnet pairs with the most records):

// Variable to make time filter easier
let lookback = ago(30d);
// Get the malicious IP in a table for cross-referencing later on
let MaliciousIps = NTAIpDetails 
    | where TimeGenerated > lookback
    | where FlowType == "MaliciousFlow"
    | project Ip, PublicIpDetails, ThreatType, ThreatDescription, Location, Url, DnsDomain
    | distinct *;
// Get the main fields out of flow logs marked as malicious
let MaliciousFlows =  NTANetAnalytics 
    | where TimeGenerated > lookback
    | where SubType == "FlowLog" and FlowType == "MaliciousFlow"
    | project SrcIp, DestIp, SrcVm, DestVm, SrcSubscription, DestSubscription, SrcSubnet, DestSubnet, BytesSrcToDest, BytesDestToSrc, AclGroup, AclRule,AllowedInFlows, AllowedOutFlows, DeniedInFlows, DeniedOutFlows;
// Get the subset of the malicious flows where the **src** IP is the malicious one
let SrcMalicious = MaliciousFlows 
    | lookup kind=inner MaliciousIps on $left.SrcIp == $right.Ip
    | extend ImpactedVM = iff(isnotempty(DestVm), strcat("/subscriptions/", DestSubscription, "/resourceGroups/", tostring(split(DestVm, "/")[0]), "/providers/Microsoft.Compute/virtualMachines/", tostring(split(DestVm, "/")[1])), ''), Direction="Inbound"
    | extend ImpactedSubnet = tostring(split(DestSubnet, "/")[2])
    | extend AllowedFlows = AllowedInFlows, DeniedFlows=DeniedInFlows
    | project
        MaliciousIp = strcat('🌐 ', SrcIp),
        ImpactedLocalIp = strcat('🖥️', DestIp), ImpactedSubnet,
        ImpactedVM, PublicIpDetails, ThreatType, DnsDomain, ThreatDescription, Location, Url, Direction, SrcVm, DestVm, BytesSrcToDest, BytesDestToSrc,
        AclGroup, AclRule, AllowedFlows, DeniedFlows;
// Get the subset of the malicious flows where the **dst** IP is the malicious one
let DestMalicious = MaliciousFlows 
    | lookup kind=inner MaliciousIps on $left.DestIp == $right.Ip 
    | extend ImpactedVM = iff(isnotempty(SrcVm), strcat("/subscriptions/", SrcSubscription, "/resourceGroups/", tostring(split(SrcVm, "/")[0]), "/providers/Microsoft.Compute/virtualMachines/", tostring(split(SrcVm, "/")[1])), ''), Direction="Outbound"
    | extend ImpactedSubnet = tostring(split(SrcSubnet, "/")[2])
    | extend AllowedFlows = AllowedOutFlows, DeniedFlows=DeniedOutFlows
    | project
        ImpactedLocalIp = strcat('🖥️ ', SrcIp), ImpactedSubnet,
        MaliciousIp = strcat('🌐 ', DestIp),
        ImpactedVM, PublicIpDetails, ThreatType, DnsDomain, ThreatDescription, Location, Url, Direction, SrcVm, DestVm, BytesSrcToDest, BytesDestToSrc,
        AclGroup, AclRule, AllowedFlows, DeniedFlows;
// Put both together
SrcMalicious | union DestMalicious
| where AllowedFlows > 0
| summarize MaliciousFlows=count() by Location, ImpactedSubnet
| top 10 by MaliciousFlows
| render barchart

The previous query uses a field called “ImpactedSubnet”, but you can use the same logic to refer to an “ImpactedVNet”.

The native log analytics query field does not have as many visualizations as Azure Monitor Workbooks, but you can still get an idea with the bar chart rendering:

What is the previous representation telling us? Most of the allowed malicious traffic in our network is coming from China, and it is mostly impacting the “vm0” and “vm1” subnets.

Conclusion

I hope you could get some ideas to leverage VNet Flow Logs to gain some insights about your traffic and how to fine tune your security policies to improve the posture of your Azure footprint. If you have other cool queries, please let me know in the comments!