Deploy (Azure) Network-as-Code as a champ

Virtually every expert out there recommends following an Infrastructure-as-Code approach to manage Azure Networks, and even more so when dealing with traffic segmentation features such as firewall rulesets and network security groups (those tend to change more frequently than other resources). And yet, there is surprisingly little guidance on how to do so, and about the potential challenges and difficulties that can (will) come up during such a process.

So I decided to give this a try, and I created a Github repo to store Azure Networking configurations for Azure Firewall and Network Security Groups: https://github.com/erjosito/segmentation-iac. Spoiler alert: creating an ARM, bicep or Terraform template is the easiest bit, the hard work comes after that. If you want to learn more, keep reading!

Separation of concerns

Your first challenge is how to structure your IaC configuration. Yes, you could create a single template that deploys everything, but that wouldn’t be too manageable, would it? Whenever you do the slightest change, you need to edit the whole file with the risk of breaking everything. Additionally, you want to contain the blast radius of changes: screwing something up in a test environment shouldn’t impact production. Not to mention that the workflow deploying this monster template would need access to pretty much every subscription in your setup.

So you will quickly come to the conclusion that you need multiple templates, each of them separated in files or modules, so that you can assign different ownerships (learn about GitHub codeowners feature). Moreover, those files might even reside in different repositories, if your organization follows a multirepo strategy. Many articles exist out there about monorepo vs multirepo architectures, but in my opinion multirepo is clearly winning the race. More to multirepo and crossing repository boundaries further down.

Bicep beats ARM, every day of the week

So far I hadn’t found the killer feature of bicep that had made me switch from ARM. I like the fact that ARM is based in JSON, so I can easily create ARM templates programmatically with Python (see for example my scripts to convert third-party firewall policies to Azure Firewall ARM templates here). Hence, I started my IaC repo with ARM, but the previous principle “Separation of concerns” quickly convinced me of bicep’s superiority:

Bicep modules can be based on relative file paths, which makes it much easier working in branches. On the contrary, ARM nested templates are based on URLs, which makes the handling much more cumbersome.
You can import information from files with functions such as loadJsonContent and loadYamlContent. For example, you can have different teams working each on the firewall rules for their applications, and then merge all of those into a single bicep template. ARM completely lacks these file-based functions.

Azure Firewall Rule Collection Groups

When separating “stuff” in different files, you need to know what that “stuff” actually is. Let me give you an example: if you want to separate an Azure Firewall Policy across different modules, you can use Rule Collection Groups (RCGs). This is a great separation boundary, because you can deploy a single rule collection group in its own ARM/bicep/Terraform template, without having to deploy the whole Azure Firewall policy.

There is one problem though: as documented in the Azure Firewall Limits, an Azure Firewall Policy supports a maximum of 60 RCGs. If you can get away with partitioning your firewall policy in 60 chunks or less, you are good! For example, if you have 20 applications, and every application has 2 environments (test and production), then you would only need 40 RCGs, and a change in a given app environment would only trigger a deployment for its own RCG.

If you have more than that, like many large organizations that count hundreds of applications in their workloads, your RCG becomes more a group of applications, a “Line of Business”. You can still use Rule Collections (RCs) for each app environment, but whenever there is a change in one of the RCs, you need to merge all RCs into one RCG template and deploy it. You can see an example of such a merge process in rcg-app03.bicep, which takes 4 rule collections, each from a separate JSON file, into a single bicep template.

In my example I separate the Azure Firewall policy in 5 RCGs: a shared one, with common rules enforced by central IT, plus one RCG for each application (there are four apps in the repo, originally named app01, app02, app03 and app04). Each application might introduce further separation: app03 consists of four environments (dev, QA, staging and prod), and app04 has two environments (test and prod).

GitHub Actions All The Things!

Application teams should own their code. Including the required information to deploy the NSG and firewall rules that their applications need to work. The time for a single firewall admin (team) controlling all the rules, without really understanding what those rules do or whether they are still required, is long over.

And yet, app admins shouldn’t be expected to be network experts. Hence, some guard rails are required to detect and correct mistakes before deploying. Welcome to GitHub actions.

If you protect your code branch, pull requests are mandatory before merging code changes. This brings the possibility for one or more code owners to review the code, and for checks to be performed in the changes, to make sure that the new configuration is valid and compliant.

These checks can be implemented with GitHub actions, which you can consume from the GitHub Action Marketplace. For example, you can use the azure/arm-deploy action in Validate deployment mode to verify that your templates are syntactically correct:

      - uses: azure/arm-deploy@v1
        name: Run preflight validation for shared infra
        with:
          resourceGroupName: ${{ secrets.AZURE_RG }}
          template: ./shared/bicep/azfwpolicy.bicep
          deploymentMode: Validate

However, you can code GitHub actions yourself too to verify the semantics of your templates (that they do what they are supposed to do), which is actually surprisingly easy to do: for example the cidr_prefix_length_bicep action of the repo verifies that admins are not using network masks too lenient. You can see the 5 files required to create a custom action:

An action.yml file that contains the inputs and outputs
A README.md markdown file with some instructions on how to use the action
A Dockerfile, since GitHub will run this as a container. The Dockerfile is pretty simple: it uses a python image, loads up some Python modules, and run the Python file that will perform the check
A requirements.txt file, with the required libraries. The only thing I need is pycep, a parser to read the contents of bicep templates into python objects.
And finally, the heart of the action: entrypoint.py. It is a simple script that errors out whenever the condition is not satisfied. In this example, whenever it finds a rule which uses a network prefix mask lower than 24.

You can now create a workflow that will run on every pull request, and verify that the bicep code follows your rules: check_bicep_code.yml.

You can use GitHub actions for many things. I have some other examples such as ipgroups_max_bicep to check that the total number of IP Groups doesn’t exceed a given number (the limit for an Azure Firewall Policy is 100), or fw_rule_prio_bicep, that checks that the priorities of firewall rule collection groups and rule collections defined by app admins are contained in a given range. I am sure your organization will have many other rules that need to be followed, and as long as they are based on code, you can enforce them with a little bit of Python and a GitHub action.

Using concurrency groups

You want to avoid concurrent deployment operations, since some resources in Azure do not like that too much (in the best case, one of the operations will be rejected, in the worst case you could end up with resources in a Failed state). Luckily, GitHub offers concurrency groups to make sure that deployments are always sequential. You can check how this feature is used in deploy_azfw_bicep.yml:

jobs:
  deploy-azfw-policy:
    concurrency:
      group: deploy_azfw

At any given point in time, only one deployment that belongs to the deploy_azfw concurrency group will be executed. If a second one were to be scheduled (for example, because two pull requests were merged in quick succession), it will wait until the first one finishes.

Scope your deployments

Even though templates are supposed to be idempotent, you want to scope your deployments. For example, when changes are introduced into a test environment, there is no need to redeploy the whole production infrastructure.

You can as well scope per areas: for example, in the repo I have a separate deploy_azfw_bicep.yml for the Azure Firewall policy, and different workflows for application-specific infrastructure (deploy_app01_infra.yml, deploy_app02_infra.yml, and so on).

Another reason for scoping deployments is that they will go to different resource groups and even subscriptions, and you don’t want to have a “god” credential that is able to do everything everywhere. While the infrastructure for a given app will only need access to the resource group where the workload is deployed, the related firewall policy definitions will reside in a different shared subscription, usually called “connectivity” or something similar (see Subscription Considerations in the Azure Landing Zone docs for more details).

There are two ways in which you can scope deployment workflows. The most obvious one is with job triggers. For example, looking at deploy_app01_infra.yml, you can see that it is only executed when certain files in the app01 folder are modified:

on:
  push:
    branches: [master]
    paths:
      - 'app01/bicep/nsg*.bicep'
      - 'app01/bicep/infra*.bicep'

Depending on how you structure your files in folders, these filters might look very different. If you need a tighter control, even to execute conditionally certain parts of the workflow, you can use the paths-filter action to generate boolean variables with the information about whether certain files have been modified or not. You should design your templates and your folders so that you don’t need this, but if you do, you can find an example of how to use it in the deploy_azfw_arm.yml workflow.

“But I don’t want to write JSON or YAML”

You might get this from some of your teams. You could let them work on a CSV format, which they can open with their favorite spreadsheet tool, and then import that CSV to your template at deployment time.

You can see an example in nsg-rules-app02.csv:

You don’t even need to go to Excel to get a nice representation, as you can see from the GitHub screenshot above. However, you need to transform this CSV into something useful before deploying. In the deploy_app02_infra.yml workflow you can see that this step is taken before the actual deployment:

    # Expand CSV file with NSG rules to JSON
    - name: Expand CSV file with NSG rules to JSON
      run: |
        python3 ./scripts/nsg_csv_to_json.py \
           --csv-file ./app02/bicep/nsg-rules-app02.csv \
           --output-file ./app02/bicep/nsg-rules-app02.json \
           --verbose

The script nsg_csv_to_json.py contains simple code that generates a JSON file out of the CSV, which can then be imported by the bicep template nsg-app02.bicep:

var securityRules = loadJsonContent('./nsg-rules-app02.json')

Triggering remote workflows

Let’s move up the complexity ladder a couple of steps. One of my sample apps, app04, is stored in its own repo https://github.com/erjosito/segmentation-iac-app04. The firewall rules required by app04 are owned and maintained by the application admins in rcg-app04.bicep, in app04’s own repo. However, app04 admins have no credentials to deploy into the connectivity subscription (where the Azure Firewall policy resides), so how does this work?

My proposed solution is the following: first, a modification in the firewall rules for app04 will trigger the workflow deploy_prod_azfw_bicep.yml in app04’s repo. This workflow is actually very simple, and it has only one step:

jobs:
  Remote-AzFW-Deployment:
    runs-on: ubuntu-latest
    steps:
      - name: Dispatch triggering remote workflow
        run: |
          curl -sL \
            -X POST \
            -H "Accept: application/vnd.github+json" \
            -H "Authorization: token ${{ secrets.SHARED_REPO_ACCESS_TOKEN }}"\
            -H "X-GitHub-Api-Version: 2022-11-28" \
            https://api.github.com/repos/erjosito/segmentation-iac/actions/workflows/deploy_azfw_bicep_manual.yml/dispatches \ 
            -d '{"ref":"master"}'

That curl command triggers the workflow deploy_azfw_bicep_manual.yml in the main repo using a Personal Access Token, which does have the right credentials to deploy into the connectivity subscription. There are a couple of interesting things in this workflow. First, it includes the workflow_dispatch trigger, so that it can be started remotely via the API. Secondly, it checks out app04’s repository as well, and includes the policies defined by app04’s admins into the Azure Firewall Policy deployed to the centralized subscription:

    # Checkout code
    - name: Checkout local repo
      uses: actions/checkout@main
    # Checkout remote repo for app04
    - name: Checkout app04 repo
      uses: actions/checkout@v2
      with:
        repository: erjosito/segmentation-iac-app04
        ref: master
        path: './app04'

Shared NSG rules

A common requirement for Azure Network Security Groups is making sure that they all have some rules defined by the central IT organization, for example to block protocols that are well-known for not being secure, or to shape the NSGs in a particular way.

For example, I usually recommend adding an explicit deny rule to NSGs, and not relay on the default NSG rules. The reason is because the default rule to permit internal traffic in an NSG can easily become a “permit any to any”, if there is a UDR for 0.0.0.0/0 associated to the subnet where the Virtual Machine is:

This is not as bad as it may sound, because that 0.0.0.0/0 UDR usually sends all traffic to a firewall that will do the filtering, but that rule is still a thorn in my side. You can fix that by adding an explicit deny at the bottom of each NSG, before the default rules are run.

There are a couple of ways you can do this. And by the way, Azure Virtual Network Manager won’t help you on this one, because AVNM admin rules are executed before your other NSG rules, not after:

You can use Azure Policy: this great example by Guillaume Beaud deploys an RDP rule with a fixed priority for every NSG. You can easily modify it to deploy the explicit deny rule, like this.
Alternatively, you can augment your NSG templates with the shared rules before deploying to Azure, as you can see for example in nsg-app01.bicep, where a module fetches the shared rules and adds them to the NSG definition:

// Deploy shared NSG rules
module sharedInboundRules '../../shared/bicep/nsg-shared-inbound-rules.bicep' = {
  name: 'in-rules-deploy'
  params: { 
    nsgName: app01nsg.name
  }
}

Adding up

First of all, thanks for reading all the way down to here! I have different messages for you, depending on who you are:

If you are thinking about implementing Infrastructure-as-Code, you’d better brush up on advanced features of your version control tooling. If using GitHub, I mentioned some in the article: protected branches, code owners, workflow triggers, custom actions and code owners are concepts you should get familiar with.
If you are a consultant recommending to use IaC to your customers, please consider the deep implications of an organization’s process structure that will impact the implementation.
Lastly, if you are a vendor, please design your services so that they are IaC-friendly, which entails much more than supporting ARM and Terraform.

2 thoughts on “Deploy (Azure) Network-as-Code as a champ”

The latest technology news week 23-2023 - ivobeerens.nl

June 10, 2023 at 8:51 am

[…] Deploy (Azure) Network-as-Code as a champ. Link […]

LikeLike

[FI] Tietoliikennealan katsaus 2023-06 – loopback1.net

July 5, 2023 at 3:25 pm

[…] ominaisuuksista huolimatta Connection Monitor -hälytyksillä. Toisessa blogissa hän neuvoo tehokkaampaan IaC-koodin luomiseen ja käyttöön Azuren verkkojen luomisessa. Azure Virtual WAN tukee nyt tietoturvahubin full mesh -kytkentää, […]

LikeLike