Understanding Kubernetes Architecture
Core Components
Kubernetes is composed of the control plane (API Server, Controller Manager, Scheduler, etcd) and node agents (kubelet, kube-proxy, container runtime). Failures often arise when control plane communication is broken or node conditions change unexpectedly.
Workload Deployment Model
Kubernetes workloads are deployed via declarative manifests using Deployments, StatefulSets, DaemonSets, and Jobs. Issues often occur due to misconfigured manifests, incorrect resource requests, or broken container images.
Common Kubernetes Issues
1. Pods Stuck in Pending State
Occurs due to lack of available resources, node taints, or unsatisfied affinity/anti-affinity rules.
2. CrashLoopBackOff or ImagePullBackOff Errors
Triggered by application-level failures, incorrect entrypoints, missing ConfigMaps/Secrets, or inaccessible container registries.
3. Persistent Volume Claims (PVC) Not Bound
Caused by mismatched StorageClass, unsupported access modes, or unprovisioned PersistentVolumes.
4. Ingress or Service Not Routing Traffic
Often due to misconfigured Ingress rules, missing annotations, incorrect service selectors, or CNI plugin/networking issues.
5. DNS Resolution Fails Inside Pods
Results from CoreDNS misconfiguration, network policy restrictions, or kubelet DNS overrides.
Diagnostics and Debugging Techniques
Use kubectl describe and logs
Retrieve detailed state of pods and workloads:
kubectl describe pod my-app-pod
kubectl logs my-app-pod -c app-container
Check Node Conditions and Scheduler Events
Identify resource pressure or taints preventing pod scheduling:
kubectl get nodes
kubectl describe node node-name
Test Service and Network Reachability
Use temporary pods and tools like nslookup
, curl
, and ping
to debug DNS and network:
kubectl run -it --rm debug --image=busybox -- sh
Validate PVC and PV Relationships
Ensure PVC is requesting a matching StorageClass and access mode:
kubectl get pvc
kubectl describe pvc my-pvc
Inspect Events and Deployment Rollouts
Track rollouts and failed deployments:
kubectl rollout status deployment/my-app
kubectl get events --sort-by=.metadata.creationTimestamp
Step-by-Step Resolution Guide
1. Fix Pending Pods
Check if nodes have sufficient resources and no taints prevent scheduling:
kubectl describe pod pod-name
kubectl get nodes -o wide
Add tolerations or adjust resource requests as needed.
2. Resolve CrashLoopBackOff Errors
Inspect container logs and check startup commands. Confirm environment variables and secret volumes are mounted:
kubectl logs pod-name -c container-name
kubectl describe pod pod-name
3. Bind PVCs Correctly
Ensure StorageClass exists and supports the requested access mode:
kubectl get sc
kubectl describe pv pv-name
Adjust PVC or create a matching PV manually if needed.
4. Repair Ingress and Service Issues
Check Ingress controller is installed and running. Validate annotations and paths:
kubectl get ingress
kubectl describe ingress ingress-name
Verify that backend service and endpoints match selectors.
5. Restore DNS Resolution
Check CoreDNS logs and ConfigMap:
kubectl logs -n kube-system -l k8s-app=kube-dns
kubectl edit configmap coredns -n kube-system
Ensure pods can access 10.96.0.10
or the cluster DNS IP.
Best Practices for Kubernetes Operations
- Use readiness and liveness probes to improve availability and observability.
- Use
kubectl diff
andkubectl apply --server-side
to manage declarative resources safely. - Label resources consistently to improve targeting and maintenance.
- Use resource limits and requests to control scheduling and prevent noisy neighbor issues.
- Monitor with Prometheus, Grafana, and alert on common symptoms like pod restarts or high CPU usage.
Conclusion
Kubernetes enables powerful orchestration at scale but demands disciplined operations and in-depth understanding of its primitives. Most production issues stem from misconfigured workloads, misaligned resources, or overlooked environment-specific constraints. With effective use of kubectl
, logging, and resource validation, DevOps teams can resolve common pitfalls and ensure resilient Kubernetes clusters.
FAQs
1. Why is my pod stuck in Pending state?
Check for insufficient resources or node taints. Use kubectl describe pod
to identify scheduling constraints.
2. What causes CrashLoopBackOff in Kubernetes?
Usually an application failure on startup, invalid commands, or missing dependencies. Review logs and init containers.
3. Why isn't my PVC binding to a PV?
The PVC may request an unsupported access mode or nonexistent StorageClass. Ensure compatibility between PVC and PV.
4. My Ingress is not routing traffic—why?
Ingress controller may not be deployed, or rules/annotations are misconfigured. Also verify DNS and service endpoints.
5. How do I fix DNS resolution issues in pods?
Inspect CoreDNS logs, verify pod resolv.conf contents, and ensure network policies or iptables rules aren't blocking DNS access.