Azure Machine Learning inferencing on AKS under the covers

You probably know that you can use Azure Machine Learning Services to support you along the complete life cycle of your Machine Learning development, from training to deployment. And you probably know as well that for production-grade deployments one of the best platforms to run your inferencing is Kubernetes. From the Azure Machine Learning portal (or with its CLI or Python SDK) you can deploy your model into an Azure Kubernetes Service cluster very easily, but what is really happening under the covers? This is what I am going to explore in this post.

The first thing you need to do is to create a cluster. You have two options here:

  • Letting Azure Machine Learning Services (AMLS) create your cluster
  • Creating the cluster yourself, and attaching it to AMLS later

I prefer the second option for one simple reason: if I create the cluster myself with AKS tools I can choose certain options that are not available from the AMLS creation interface. For example, I could not find how to create a cluster with the Kubernetes Cluster Autoscaler enabled with AMLS, but doing that with the AKS CLI is very easy:

subnet_id=$(az network vnet subnet show -n $aks_subnet --vnet-name $aks_vnet -g $rg --query id -o tsv)
az aks create -n $aks_name -g $rg -l $location \
              -s $vm_size --vnet-subnet-id $subnet_id \
              --network-plugin azure --generate-ssh-keys \
              --enable-cluster-autoscaler --max-count 3 --min-count 1

The previous commands create a cluster in a specific vnet and subnet (that should already exist) with cluster autoscaler enabled. After this command completes, you will have a fully functional Kubernetes cluster, but Azure Machine Learning still does not know about it. In order to tell AMLS how to deploy models to the new cluster, we need to “attach” the cluster.

You can attach the cluster using either the Azure Machine Learning portal, the Azure Machine Learning CLI or the Azure Machine Learning SDK. Since most data scientists I work with use Python to interact with AMLS, this is what I will use here as well:

from azureml.core.compute import AksCompute
from azureml.core.compute import ComputeTarget
Use the default configuration (can also provide parameters to customize)
aks_name = 'amls'
aks_rg = 'amls'
attach_config = AksCompute.attach_configuration(resource_group = aks_rg,
                          cluster_name = aks_name,
                          cluster_purpose = AksCompute.ClusterPurpose.DEV_TEST)
attach_config.enable_ssl(leaf_domain_label = "cloudtrooper")
aks_target = ComputeTarget.attach(ws, aks_name, attach_config)
aks_target.wait_for_completion(show_output=True) # This might hit ARM API throttling limits!

Easy enough! By the way, if you want to see a full-blown example for deployment to AKS you can check these examples. You might notice that the previous code is enabling SSL with a “leaf domain label”. This will be used as a prefix for a certificate generated and maintained by Microsoft. More to this later. But what did that code do to our cluster? Let’s have a look:

$ kubectl get deploy
azureml-ba  1/1   1          1         5m
azureml-fe  1/1   1          1         5m

As you can see, this created two deployments, each with one pod. I am not too sure of what those two do, but the “fe” deployment is creating the frontend that will catch our inferencing calls. We can verify that looking at the created services, where you can see that both HTTP and HTTPS have been enabled:

$ kubectl get svc
NAME                TYPE         CLUSTER-IP   EXTERNAL-IP   PORT(S)                    AGE
azureml-fe          LoadBalancer 80:31051/TCP,443:31350/TCP 4d13h
azureml-fe-int-http ClusterIP <none>        9001/TCP                   4d13h
kubernetes          ClusterIP     <none>        443/TCP                    4d13h

Additionally, there are two daemon sets that will make sure that the pods get access to the Storage Account in the Azure Machine Learning workspace using blobfuse:

$ kubectl get ds
blobfuse-flexvol-installer 2       2       2     2          2     154m
volume-monitor             2       2       2     2          2              <none> 85s

Other than a Config Map with the configuration, there is not much more to it. Let’s go and deploy our first model (you will need to load a model, refer to the full example for more details):

from azureml.core.webservice import AksEndpoint
namespace_name="test" # This will be the k8s namespace
endpoint_name = "k8sendpoint01"
version_name= "v01" # Minimum 3 characters
endpoint_deployment_config = AksEndpoint.deploy_configuration(tags = {'modelVersion':'0.1', 'department':'finance'}, description = "my first version", namespace = namespace_name, version_name = version_name, traffic_percentile = 40)
endpoint = Model.deploy(ws, endpoint_name, [model], 

Alright, there is a lot to unpack there. First of all, we are creating two things in one go, which can be confusing: we are creating an “endpoint”, and a “version” for that endpoint. An endpoint is essentially a URL, behind which you can have multiple versions of a model running simultaneously. The parameter “traffic_percentile” will determine how much traffic each version receives.

About that: what happens if you only have a version that gets 40% of the traffic, as the example above? What happens with the other 60%? It is sent to the “default” version, which happens to be the first version you deploy to an endpoint. In other words, the “traffic_percentile” attribute is pretty useless in our example, but it is important realizing that Azure Machine Learning has this traffic splitting functionality.

Another interesting variable is the namespace_name, that designates the kubernetes namespace in our cluster that will contain the deployed resources. Let’s have a look there:

$ kubectl -n test get deploy
v01  1/1   1          1         11m
$ kubectl -n test get svc
v01  NodePort <none>      80:30314/TCP 12m
$ kubectl -n test get pod
v01-6c68d8957b-mh6cz 1/1   Running 0        12m

As you can see, it is a pretty standard configuration: a deployment with a service. What is strange here is what you actually do not see: the service is of the type “ClusterIP” (only reachable from inside the cluster), and there is no ingress controller or ingress. This is because the Azure Machine Learning frontend (that “fe” deployment we saw earlier) is in charge of distributing load across the pods in a version (and across versions, according to the “traffic_percentile” of each version).

You can have a look of some of the most important properties of your newly deployed endpoint and version. You can use this simple code:

from azureml.core.webservice.aks import AksEndpoint
print("Endpoint info:")
print("* Endpoint name:",
print("* Auth enabled:", endpoint.auth_enabled)
print("* Compute type:", endpoint.compute_type)
print("* Scoring URI:", endpoint.scoring_uri)
print("Version info:")
print("* Auth enabled:", endpoint.versions[version].auth_enabled)
print("* Traffic percentile:", endpoint.versions[version].traffic_percentile)
print("* Created by:", endpoint.versions[version].created_by['userName'])
print("* App Insights enabled:", endpoint.versions[version].enable_app_insights)
print("* Version type:", endpoint.versions[version].version_type)
print("* Model ID:", endpoint.versions[version].models[0].id)
print("* Scoring URI:", endpoint.versions[version].scoring_uri)
print("* Is Default:", endpoint.versions[version].is_default)
print("* State:", endpoint.versions[version].state)
print("* Errors:", endpoint.versions[version].error)
print("* Image:", endpoint.versions[version].image)
print("* Concurrent requests per container:", endpoint.versions[version].max_concurrent_requests_per_container)
print("* Maximum request wait time:", endpoint.versions[version].max_request_wait_time)
print("* Scoring timeout (ms):", endpoint.versions[version].scoring_timeout_ms)
print("* Replicas:", endpoint.versions[version].num_replicas)
print("* CPU cores:", endpoint.versions[version].cpu_cores)
print("* Memory (GB):", endpoint.versions[version].memory_gb)
autoscaler = endpoint.versions[version].autoscaler
print("* Autoscale enabled:", autoscaler.autoscale_enabled)
print("* Autoscale max replicas:", autoscaler.max_replicas)
print("* Autoscale min replicas:", autoscaler.min_replicas)

Which will generate this output, describing some of the most important attributes of your endpoint and version. Note how there are two scoring URIs, one for the endpoint (where the traffic is distributed across all the versions in the endpoint), and URIs specific to each version, both with the prefix that we had when attaching the AKS cluster to the AMLS workspace. As you can tell from the domain, in the background AMLS is using Azure App Service managed certificates:

Endpoint info: 
* Endpoint name: diyendpoint1 
* Auth enabled: False 
* Compute type: AKSENDPOINT 
* Scoring URI:
Version info: 
* Auth enabled: False 
* Traffic percentile: 40.0 
* Created by: Jose Moreno 
* App Insights enabled: False 
* Version type: Control 
* Model ID: sklearn_regression_model.pkl:1 
* Scoring URI: 
* Is Default: True 
* State: Healthy 
* Errors: None 
* Image: None 
* Concurrent requests per container: 1 
* Maximum request wait time: 500 
* Scoring timeout (ms): None 
* Replicas: 1 
* Autoscale enabled: True 
* Autoscale max replicas: 2 
* Autoscale min replicas: 1

You could actually deploy another version to the same endpoint with this python code:

endpoint.create_version(version_name = version_name_add,
                        tags = {'modelVersion':'2', 'department':'finance'},
                        description = "my second version",
                        traffic_percentile = 10)

This would create a second deployment in the same namespace, very similar to the first one:

$ kubectl -n test get deploy
v01  1/1   1          1         13m
v02  1/1   1          1         2m

Note that the second version deployed is not the default version any more, so the “traffic_percentile” here is important: it will only get 10% of the inferencing requests sent to the endpoint.

This configuration would be very useful if you would like to test how a model behaves in production, by redirecting a small amount of traffic to it, and only completing the migration when you are satisfied with its performance and results.

You can modify existing attributes of the deployment or the versions from the Python SDK. For example, you can modify the autoscaling settings for one of the deployed models you can use this Python code:


If you look into AKS now, you might have expected to find a Horizontal Pod Autoscaler (at least I did). However no HPA is created, since autoscaling versions is another one of the functions of that “frontend” pods we saw earlier. The frontend pod will monitor how many concurrent requests are coming to each pod in a version, and scale in/out the corresponding deployment if required. If more hardware resources are needed, the Cluster Autoscaler that we configured when creating the cluster will kick-in and provision new Kubernetes nodes.

Speaking about resources, from the Python SDK we can control the resources allocated to the inferencing pods. So far we were using the defaults, but you can change this even after having deployed the version:


If you look now in kubernetes to the deployment, you will see the corresponding resource requests having been updated:

$ kubectl -n test describe deploy/v01
  cpu: 200m
  memory: 600M

Something else you can do is enable authorization in your endpoint (note that this will affect all versions deployed under that endpoint). The code is fairly easy:


After enabling authorization, inferencing requests need to have an Authorization HTTP header with the format “Bearer <key>”, where you can get the two keys configured for the endpoing with the Python method “endpoint.get_keys()”. You only need to supply one key in the Authorization header, but the endpoing has two keys to facilitate key rotation.

And that concludes my tour of what is really going under the covers when you deploy a model from Azure Machine Learning to AKS. As you can see, you can control most of the inferencing model deployment from Python without necessarily knowing about Kubernetes (although that would not hurt either), making leveraging AKS for your model deployments a walk in the park.

Please let me know your thoughts about it!

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: