Advanced Kubernetes Troubleshooting for DevOps and Platform Engineers

Details: Category: DevOps Tools; By Mindful Chase; 19.Jul; Hits: 136

Kubernetes has become the de facto standard for container orchestration, powering production workloads across enterprises. However, as clusters scale and workloads diversify, complex operational issues can emerge—some subtle and hard to detect. From node pressure conditions to persistent volume race conditions, Kubernetes troubleshooting often requires deep platform knowledge. This article provides a comprehensive guide to diagnosing and resolving complex Kubernetes issues, tailored for DevOps leads, SREs, and platform engineers operating at scale.

Mindful Chase

Writing Code, Writing Stories

tbd

Experience

tbd

More to Explore

Understanding Kubernetes Architecture

Control Plane vs Node-Level Responsibilities

Kubernetes separates orchestration into control plane components (API server, scheduler, controller manager, etcd) and node-level services (kubelet, kube-proxy, container runtime). Many issues arise from assumptions made about their interaction boundaries.

Key Infrastructure Dependencies

Cloud provider APIs (for storage, networking, autoscaling)
Container runtimes (e.g., containerd, CRI-O)
CNI plugins for networking (Calico, Flannel, Cilium)

Common and Complex Troubleshooting Scenarios

1. Pods in Pending or CrashLoopBackOff State

These are surface symptoms. Root causes range from insufficient node resources, PVC unavailability, image pull failures, or readiness/liveness probe misconfigurations.

kubectl describe pod my-app-pod

2. Persistent Volume Binding Failures

PVCs may remain unbound due to zone mismatch, unavailable storage classes, or misconfigured reclaim policies—especially in multi-zone clusters.

3. Node Not Ready / DiskPressure / MemoryPressure

Node conditions degrade when resource usage exceeds thresholds. Daemons like kubelet taint nodes to prevent further pod scheduling. Monitor kubectl get nodes and node events.

4. API Server Latency or Unavailability

High etcd latency, API throttling, or network partitioning can cause kubectl hangs and autoscaler failures. Use metrics like apiserver_request_duration_seconds to identify root causes.

5. Container Runtime Failures

Misconfigurations or version mismatches between kubelet and containerd may prevent pod starts. Use journalctl -u containerd and crictl ps for deep debugging.

Diagnostic Techniques and Tools

Using Events and Describe

Always start with kubectl describe pod or kubectl get events --sort-by=.metadata.creationTimestamp to correlate scheduling or readiness failures.

Check Node and Daemon Health

Use:

kubectl get nodes -o wide
kubectl describe node NODE_NAME

Audit API Server Logs and etcd

On control plane nodes, inspect:

journalctl -u kube-apiserver
journalctl -u etcd

Network Policy and DNS Failures

Use busybox test pods and nslookup, curl to trace inter-pod communication issues. NetworkPolicy misalignment is a common issue in enterprise clusters.

Custom Controller/Operator Issues

Check CRDs, controller logs, and watch loop stability. Improper reconciler loops may cause infinite retries or resource contention.

Best Practices for Operational Resilience

Implement resource requests/limits to prevent node overcommit
Enable PodDisruptionBudgets to maintain availability during upgrades
Use taints/tolerations to isolate system workloads
Regularly audit cluster-wide RBAC policies for security and stability
Adopt observability stack: Prometheus, Grafana, Fluentd/FluentBit, Loki

Conclusion

Kubernetes abstracts away many complexities of deployment and scaling, but that abstraction can mask subtle failure modes. Engineers responsible for cluster stability must master both the declarative surface and the dynamic, often hidden, runtime behaviors. By combining kubectl diagnostics, log inspection, and a systems-level view of the control and data planes, teams can resolve even the most obscure Kubernetes issues before they impact production reliability.

FAQs

1. Why do pods get stuck in Terminating state?

Usually due to a finalizer that never completes or a volume unmount failure. Use kubectl patch to remove finalizers in emergencies.

2. How can I troubleshoot DNS resolution issues inside pods?

Deploy a debug container (e.g., busybox), check /etc/resolv.conf, and use nslookup or dig to verify cluster DNS services.

3. What causes excessive API server throttling?

Overactive controllers or CI/CD systems hitting the API too frequently. Rate-limiting and backoff mechanisms should be configured.

4. How do I detect and resolve resource contention?

Use kubectl top and Prometheus metrics to identify pods or nodes under pressure. Adjust quotas, node pools, or implement HPA/VPA.

5. Can taints and tolerations cause pods not to schedule?

Yes. If no pod tolerates a node's taint, it will remain unscheduled. Review node taints and pod tolerations during scheduling failures.

Contact Us