It is possible to create an AKS whose nodepools have more than one affinity zone. This ensures that different nodes in the cluster are physically separated in different zones within the same region, which adds redundancy: if one zone goes down, the cluster continues to function. Some limitations are mentioned in the documentation (not all regions support it and the pool VM size must be available in all affinity zones).
Of course, you can only set the affinity zones of the nodepool when you create it, once created they cannot be modified
On paper that’s great – anything to add fault tolerance is good, but beware! that comes with a price to pay, and that is that you can only link disks from the same affinity zone in which a node is. That means if you have a pod running on a node that is in Affinity Zone 1, all of its Pods will only be able to link disks that are in that same Affinity Zone. The nodes are marked with the labels topology.kubernetes.io/region and failure-domain.beta.kubernetes.io/region indicating the region and the labels topology.kuberbetes.io/zone and failure-domain.beta.kubernetes.io/zone indicating the affinity zone (there are two labels for the same, because the failure-domain is old and obsolete, but it is kept for compatibility).
Affinity zones and self-provisioned PVs
What happens when you create a self-provisioned PV, from a PVC? Remember that, in this case, the PV is created automatically by the PVC. To check what happens I have created 4 PVCs and waited for the 4 underlying PVs to have been created:
> kubectl get pv
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE
pvc-a9d08554-7557-46dd-a2b6-5c269ba7a688 5Gi RWO Delete Bound default/pvc3 managed-premium 84s
pvc-62d88cfc-d7ca-451e-abf4-99fb42651eea 5Gi RWO Delete Bound default/pvc1 default 84s
pvc-afd7dc31-f633-4c99-8b7e-1f59f58f4416 3Gi RWO Delete Bound default/pvc4 managed-premium 84s
pvc-5e147a8a-5e1f-49c5-9a02-51ebe9ee2807 8Gi RWO Delete Bound default/pvc2 default 84s
In my resource group MC_* I have the disks already created. But in which affinity zone?
$aks = "<nombre-del-aks>"
$rg = "<rg-del-aks>"
$rgmc=$(az aks show -n $aks -g $rg -ojson | ConvertFrom-Json).nodeResourceGroup
az disk list -g $rgmc --query '[][name,zones[0]]'
The output in my case is:
[
[
"kubernetes-dynamic-pvc-a9d08554-7557-46dd-a2b6-5c269ba7a688",
"2"
],
[
"kubernetes-dynamic-pvc-62d88cfc-d7ca-451e-abf4-99fb42651eea",
"1"
],
[
"kubernetes-dynamic-pvc-afd7dc31-f633-4c99-8b7e-1f59f58f4416",
"2"
],
[
"kubernetes-dynamic-pvc-5e147a8a-5e1f-49c5-9a02-51ebe9ee2807",
"1"
]
]
In this case, two of the disks have been created in affinity zone 1 and two more in affinity zone 2. The information of the affinity zone of the disk is also found in the associated PV, since this declares one nodeAffinity to ensure that pods using that PV run on a node that is in this same affinity zone:
> kubectl describe pv pvc-62d88cfc-d7ca-451e-abf4-99fb42651eea
Name: pvc-62d88cfc-d7ca-451e-abf4-99fb42651eea
Labels: failure-domain.beta.kubernetes.io/region=westeurope
failure-domain.beta.kubernetes.io/zone=westeurope-1
Annotations: pv.kubernetes.io/bound-by-controller: yes
pv.kubernetes.io/provisioned-by: kubernetes.io/azure-disk
volumehelper.VolumeDynamicallyCreatedByKey: azure-disk-dynamic-provisioner
Finalizers: [kubernetes.io/pv-protection]
StorageClass: default
Status: Bound
Claim: default/pvc1
Reclaim Policy: Delete
Access Modes: RWO
VolumeMode: Filesystem
Capacity: 5Gi
Node Affinity:
Required Terms:
Term 0: failure-domain.beta.kubernetes.io/region in [westeurope]
failure-domain.beta.kubernetes.io/zone in [westeurope-1]
Message:
Source:
Type: AzureDisk (an Azure Data Disk mount on the host and bind mount to the pod)
DiskName: kubernetes-dynamic-pvc-62d88cfc-d7ca-451e-abf4-99fb42651eea
DiskURI: /subscriptions/ac529d82-e944-485d-9a9b-4d46d44214da/resourceGroups/mc_cloudiseasy_test_westeurope/providers/Microsoft.Compute/disks/kubernetes-dynamic-pvc-62d88cfc-d7ca-451e-abf4-99fb42651eea
Kind: Managed
FSType:
CachingMode: ReadOnly
ReadOnly: false
Events: <none>
Notice how the Node Affinity declares that this PV must be in a node that has the failure-domain.beta.kubernetes.io/zone value label westeurope-1(the affinity zone 1, which is the one that corresponds to the disk associated with this PV).
If I now create a pod that uses said PV (through the associated PVC (pvc1 in the example above)) the Kubernetes Scheduler will make sure that said pod runs on a node that is in the same affinity zone. The result is that the pod is assigned to the node aks-agentpool-65616547-vmss000000that is the one in affinity zone 1:
> kubectl get po -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
nginxpvc1-5845934b55-megtq 1/1 Running 0 112s 10.242.0.12 aks-agentpool-65616547-vmss000000 <none> <none>
What if, for whatever reason, this node could not run the pod? Well, very simple, it would stay in pending:
> kubectl delete deploy nginxpvc1
deployment.apps "nginxpvc1" deleted
> kubectl taint node aks-agentpool-65616547-vmss000000 donothig:NoSchedule
node/aks-agentpool-65616547-vmss000000 tainted
> kubectl apply -f .\deploy-pvc1.yaml
deployment.apps/nginxpvc1 created
> kubectl get po
NAME READY STATUS RESTARTS AGE
nginxpvc1-57458397b55-dhrkd 0/1 Pending 0 58s
And if you launch a kubectl describe in the section EVENTS it would be clear that the pod cannot be executed because the scheduler cannot find any node.
Affinity zones and static PVs
If you create your PV linked to a pre-existing disk, you must put a nodeAffinity in said PV, since if you won’t, the scheduler would deploy the pod in any node. If the node is in another affinity zone, it cannot be mounted. Let’s look at an example. I have a disk called testvhd created in the affinity zone 3. I define a PV associated with said disk.
This PV is linked to the disk indicated in diskURI and to the PVC indicated in claimRef.name. It is necessary to create the PVC of course. Once the PVC is applied, the PV will be tied.
> kubectl get pv
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE
testvhd 8Gi RWO Retain Bound default/testvhd managed-premium 108s
Now we can deploy a pod that uses the PVC testvhd and see what happens. Well, two things can happen:
- That the pod is deployed, by pure chance, in a node that is in the affinity zone of the disk (3 in my case). In this case the pod will run without problems. By default, the scheduler tries to distribute the pods among the different affinity zones.
- Pod is deployed, by pure chance, to a node that is in any other affinity zone than the disk. In this case the pod will stay in ContainerCreating and in the section EVENTS you will see something like:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 72s default-scheduler Successfully assigned default/nginxpvctestvhd-5575bcddcb-bbjrh to aks-agentpool-75116936-vmss000001
Warning FailedAttachVolume 72s attachdetach-controller Multi-Attach error for volume "testvhd" Volume is already used by pod(s) nginxpvctestvhd-5575bcddcb-tgccb
Warning FailedAttachVolume 15s (x7 over 49s) attachdetach-controller AttachVolume.Attach failed for volume "testvhd" : Retriable: false, RetryAfter: 0s, HTTPStatusCode: 400, RawError: Retriable: false, RetryAfter: 0s, HTTPStatusCode: 400, RawError: {
"error": {
"code": "BadRequest",
"message": "Disk /subscriptions/xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx/resourceGroups/MC_velerotest_velerotest_westeurope/providers/Microsoft.Compute/disks/testvhd cannot be attached to the VM because it is not in the same zone as the VM. VM zone: '2'. Disk zone: '3'."
}
}
That’s because unlike PVs that are created automatically, our PV had no defined node affinity, so the Scheduler doesn’t know which node to place the pod in. Hence, you must use nodeAffinity in the definition of the PV. This way, when scheduler needs to place a pod that uses this PV, it will place it in a node that is in affinity zone 3 and everything will work correctly.
Conclusion
Using affinity zones increases the fault tolerance in your AKS but adds an additional dimension to your cluster’s topology: once a VP is placed in an affinity zone, any pod that uses it must be placed in that affinity zone. Therefore, you must also balance the affinity zones well. Example – it may not be a good idea to have only one node per zone, because in this case, if this node goes down, the pods cannot be relocated to any other node.