Understanding Kubernetes Architecture

Control Plane vs Node-Level Responsibilities

Kubernetes separates orchestration into control plane components (API server, scheduler, controller manager, etcd) and node-level services (kubelet, kube-proxy, container runtime). Many issues arise from assumptions made about their interaction boundaries.

Key Infrastructure Dependencies

  • Cloud provider APIs (for storage, networking, autoscaling)
  • Container runtimes (e.g., containerd, CRI-O)
  • CNI plugins for networking (Calico, Flannel, Cilium)

Common and Complex Troubleshooting Scenarios

1. Pods in Pending or CrashLoopBackOff State

These are surface symptoms. Root causes range from insufficient node resources, PVC unavailability, image pull failures, or readiness/liveness probe misconfigurations.

kubectl describe pod my-app-pod

2. Persistent Volume Binding Failures

PVCs may remain unbound due to zone mismatch, unavailable storage classes, or misconfigured reclaim policies—especially in multi-zone clusters.

3. Node Not Ready / DiskPressure / MemoryPressure

Node conditions degrade when resource usage exceeds thresholds. Daemons like kubelet taint nodes to prevent further pod scheduling. Monitor kubectl get nodes and node events.

4. API Server Latency or Unavailability

High etcd latency, API throttling, or network partitioning can cause kubectl hangs and autoscaler failures. Use metrics like apiserver_request_duration_seconds to identify root causes.

5. Container Runtime Failures

Misconfigurations or version mismatches between kubelet and containerd may prevent pod starts. Use journalctl -u containerd and crictl ps for deep debugging.

Diagnostic Techniques and Tools

Using Events and Describe

Always start with kubectl describe pod or kubectl get events --sort-by=.metadata.creationTimestamp to correlate scheduling or readiness failures.

Check Node and Daemon Health

Use:

kubectl get nodes -o wide
kubectl describe node NODE_NAME

Audit API Server Logs and etcd

On control plane nodes, inspect:

journalctl -u kube-apiserver
journalctl -u etcd

Network Policy and DNS Failures

Use busybox test pods and nslookup, curl to trace inter-pod communication issues. NetworkPolicy misalignment is a common issue in enterprise clusters.

Custom Controller/Operator Issues

Check CRDs, controller logs, and watch loop stability. Improper reconciler loops may cause infinite retries or resource contention.

Best Practices for Operational Resilience

  • Implement resource requests/limits to prevent node overcommit
  • Enable PodDisruptionBudgets to maintain availability during upgrades
  • Use taints/tolerations to isolate system workloads
  • Regularly audit cluster-wide RBAC policies for security and stability
  • Adopt observability stack: Prometheus, Grafana, Fluentd/FluentBit, Loki

Conclusion

Kubernetes abstracts away many complexities of deployment and scaling, but that abstraction can mask subtle failure modes. Engineers responsible for cluster stability must master both the declarative surface and the dynamic, often hidden, runtime behaviors. By combining kubectl diagnostics, log inspection, and a systems-level view of the control and data planes, teams can resolve even the most obscure Kubernetes issues before they impact production reliability.

FAQs

1. Why do pods get stuck in Terminating state?

Usually due to a finalizer that never completes or a volume unmount failure. Use kubectl patch to remove finalizers in emergencies.

2. How can I troubleshoot DNS resolution issues inside pods?

Deploy a debug container (e.g., busybox), check /etc/resolv.conf, and use nslookup or dig to verify cluster DNS services.

3. What causes excessive API server throttling?

Overactive controllers or CI/CD systems hitting the API too frequently. Rate-limiting and backoff mechanisms should be configured.

4. How do I detect and resolve resource contention?

Use kubectl top and Prometheus metrics to identify pods or nodes under pressure. Adjust quotas, node pools, or implement HPA/VPA.

5. Can taints and tolerations cause pods not to schedule?

Yes. If no pod tolerates a node's taint, it will remain unscheduled. Review node taints and pod tolerations during scheduling failures.