Key AKS Architecture Considerations
Control Plane vs Node Plane
AKS abstracts the Kubernetes control plane, but it's important to understand that issues like API throttling, kubelet failures, or CRI-O/Docker runtime errors occur at the node level, which remains under your control. Observability and access limitations can make troubleshooting non-trivial.
Integration with Azure Infrastructure
AKS integrates tightly with Azure resources such as Azure Load Balancer, Azure Disk, Azure Files, and Managed Identity. Misconfigured resource provisioning or role assignments often lead to cascading failures across deployments.
Common Troubleshooting Scenarios in AKS
1. PVC Stuck in Pending State
This often results from incorrect storage class configuration or quota exhaustion. Check PVC and PV descriptions:
kubectl describe pvc my-volume
Ensure the storage class exists and maps to an available Azure Disk SKU. Watch for errors like:
failed to provision volume with StorageClass "default": disk quota exceeded
2. Node Not Ready / Pod Eviction Loops
Commonly caused by:
- Node OS disk pressure
- CoreDNS crash loops
- DaemonSet misconfiguration (e.g., conflicting Calico versions)
Check node status and kubelet logs:
kubectl describe node aks-nodepool-12345-0 journalctl -u kubelet
3. LoadBalancer IP Not Provisioned
External IP may not bind due to subnet exhaustion or invalid NSG rules. Validate AKS subnet IP range and service manifest:
kubectl get svc my-service -o wide
Also verify with Azure CLI:
az network public-ip list --query "[?ipAddress=='']"
4. Autoscaler Not Scaling Down
Cluster Autoscaler may leave unused nodes if pods with restrictive affinity/taints remain unscheduled. Review autoscaler logs:
kubectl logs -n kube-system deployment/cluster-autoscaler
Watch for messages like:
pod requires node with unmatchable constraints
5. Intermittent API Throttling
Frequent deployments or helm releases can trigger API rate limits. Azure Resource Manager (ARM) throttles requests beyond quotas. Monitor with Azure Monitor or enable diagnostic settings on AKS resource.
Diagnostics and Debugging Techniques
1. Enable AKS Diagnostics Logs
Use Azure Monitor and Log Analytics to collect logs from kubelet, container logs, and control plane events. Configure via:
az aks monitor enable-diagnostics --resource-group my-rg --name my-aks
2. Use kubectl-debug
for Live Pod Inspection
Attach ephemeral debug containers to troubleshoot running pods:
kubectl debug pod-name -it --image=busybox --target=main-container
3. Analyze Node Metrics with Container Insights
Use Azure Monitor's Container Insights to visualize CPU/memory usage, disk pressure, and pod restart trends across node pools.
4. Event Stream Analysis
Watch real-time Kubernetes events to correlate symptoms with resource issues:
kubectl get events --sort-by=.metadata.creationTimestamp
Advanced Pitfalls and Remedies
- Misaligned AKS Versioning: Auto-upgrades can introduce breaking changes; always validate in staging clusters before enabling auto-updates.
- Insufficient UAMI Permissions: User-assigned managed identity failures block storage or ingress provisioning; verify role assignments using
az role assignment list
. - Network Plugin Drift: Mismatched Calico/CNI versions during upgrades may cause pod networking failures; explicitly version network add-ons.
- Zombie Azure Resources: Failed provisioning can leave orphaned IPs, NICs, and disks; regularly audit resource groups via tagging and lifecycle rules.
Best Practices for Stable AKS Deployments
- Always use node pools with
maxSurge
during upgrades - Use pod disruption budgets (PDBs) to control eviction during scaling/updates
- Isolate critical workloads in dedicated node pools with taints/tolerations
- Tag Azure resources for automated cleanup and audit trails
- Set up alerts for node readiness, pod crash loops, and PVC status via Azure Monitor
Conclusion
AKS simplifies Kubernetes adoption but introduces its own set of production-grade complexities. Misconfiguration of storage classes, autoscalers, network policies, and Azure integrations can result in hard-to-diagnose issues. By deeply understanding AKS internals, leveraging Azure-native observability, and applying proactive architectural practices, you can build resilient and scalable Kubernetes environments in Azure.
FAQs
1. Why are my PVCs stuck in Pending even though the storage class is defined?
The backing Azure Disk SKU may be unavailable in your region or quota exhausted. Check quota usage and region SKU availability via Azure CLI.
2. How do I debug autoscaler not scaling down unused nodes?
Review cluster-autoscaler logs and check for unschedulable pods with affinity, anti-affinity, or PDB constraints that prevent scale-in.
3. What causes LoadBalancer IP provisioning delays?
Subnet exhaustion or NSG misconfigurations block IP assignments. Ensure sufficient IP space and allow rules for required ports.
4. Can I SSH into AKS worker nodes?
Yes, using az aks ssh
with a user node pool. Ensure the node pool is Linux-based and SSH enabled during creation.
5. What is the recommended way to monitor AKS clusters?
Enable Azure Monitor with Container Insights and diagnostic logs to capture node metrics, control plane events, and application logs.