Understanding AKS Architecture and Operational Layers

Managed Control Plane vs Customer Node Pools

AKS abstracts away the Kubernetes control plane, which is fully managed and hidden from the customer. However, node pools, system components (e.g., CoreDNS, KubeProxy), and integrations (e.g., Azure AD, CSI drivers) require user-side configuration and monitoring.

Key AKS Integrations That Commonly Fail

  • Azure CNI or Kubenet (networking)
  • CSI drivers (disk, file, blob)
  • AAD pod identity and workload identities
  • Azure Monitor and Log Analytics

Misconfigurations at these integration points frequently cause cluster-level degradation or pod-level failures.

Common AKS Troubleshooting Scenarios

1. Pods Stuck in Pending or ImagePullBackOff

Often due to:

  • No available IPs in the subnet (Azure CNI)
  • Wrong nodeSelector or taints
  • Unreachable or unauthenticated ACR
kubectl describe pod my-app

Check events like FailedScheduling or ErrImagePull. Ensure node pools have capacity and ACR integration via az acr login or managed identities is in place.

2. Persistent Volume Claim (PVC) Failures

Issues arise from:

  • Unbound volume due to wrong storage class
  • CSI driver pods not running
  • Availability zone mismatch
kubectl get events | grep pvc

Use kubectl describe pvc and check if the CSI AzureDisk or File driver pods are healthy in kube-system namespace.

3. Node Pool Drift and Version Skew

Over time, user node pools may diverge from the control plane version, leading to unexpected behavior, feature mismatch, or unsupported configurations.

az aks nodepool list --resource-group RG --cluster-name CLUSTER

Update with:

az aks nodepool upgrade --cluster-name CLUSTER --name NODEPOOL --kubernetes-version X.Y.Z

4. CoreDNS or KubeProxy CrashLooping

Caused by:

  • Misconfigured DNS IP (e.g., conflict with on-prem DNS)
  • Outdated or corrupted CoreDNS config
  • Custom CNI or kube-proxy tuning errors
kubectl -n kube-system logs -l k8s-app=kube-dns

Ensure config maps and cluster DNS IP (e.g., 10.0.0.10) align with Azure CNI or Kubenet CIDR setup.

5. AAD Pod Identity or Workload Identity Failures

Symptoms include 403s from Azure SDKs or metadata service errors. Causes:

  • MIC or NMI pods not running
  • Binding mismatch between identity and pod labels
  • Transition from AAD Pod Identity to Workload Identity incomplete
kubectl logs -n kube-system -l component=nmi

Validate role assignments, pod labels, and CRDs for AzureIdentityBinding.

Diagnostic Strategies and Tooling

1. Azure Resource Health and Activity Logs

Use az resource health or the Azure Portal to check for AKS control plane or VMSS-level issues.

2. Kubelet and Container Logs

Access node logs via:

az aks nodepool list --query "[].{name:name,mode:mode}"

Then SSH or use kubectl debug node/NODENAME to check kubelet logs.

3. Network Tracing with Netshoot or tcpdump

Run ephemeral debug pods to inspect DNS, IP routes, and service resolution.

kubectl run netshoot --rm -it --image=nicolaka/netshoot -- bash

4. Azure Monitor and Container Insights

Check metrics for CPU pressure, OOMKills, and failed mounts. Use kubectl top or query from Log Analytics workspace.

Fixes and Best Practices

1. Networking

  • For Azure CNI, allocate adequate IPs per node subnet
  • Enable VNet peering properly if integrating hybrid networks
  • Use Calico network policies only if needed—avoid default blocking

2. Identity Management

  • Migrate from AAD Pod Identity to Workload Identity (GA)
  • Use managed identity bindings per namespace/app
  • Automate role assignments with Terraform or Bicep

3. Storage

  • Use zonal disks for availability
  • Upgrade CSI drivers regularly via AKS addons
  • Use premium storage class for DBs, not default (standard HDD)

4. Cluster Maintenance

  • Use auto-upgrade for node pools (with caution)
  • Regularly audit node taints, labels, and capacity
  • Monitor drift using Azure Policy and Defender for Containers

Conclusion

AKS enables scalable and secure Kubernetes workloads on Azure, but its enterprise-grade usage exposes architectural and operational complexity. Misconfigured integrations, networking constraints, identity mismatches, and version drift can silently undermine cluster stability. By applying structured diagnostics, using the right tooling, and implementing architectural safeguards, teams can ensure reliable and efficient operation of AKS clusters in production.

FAQs

1. Why are my AKS pods stuck in Pending?

Often due to subnet IP exhaustion, taints preventing scheduling, or unbound PVCs. Use kubectl describe pod and kubectl get events.

2. How can I check which AKS node pool is outdated?

Run az aks nodepool list and compare Kubernetes versions against the cluster version. Upgrade with az aks nodepool upgrade.

3. What causes persistent volume mount failures in AKS?

Common causes include missing CSI driver pods, zone mismatch between pod and volume, or unsupported storage class settings.

4. How do I migrate from AAD Pod Identity to Workload Identity?

Disable pod identity addon, install workload identity webhook, update AzureIdentity to AzureWorkloadIdentity, and reassign roles accordingly.

5. Is it safe to enable auto-upgrade for AKS node pools?

It depends on workload sensitivity. Auto-upgrades reduce maintenance burden but may cause disruption without proper pod disruption budgets (PDBs).