Understanding the Restart Loop Problem
How Docker Handles Restarts
Docker uses restart policies like --restart=always
or --restart=on-failure
to determine whether a container should be restarted after it exits. If a container exits unexpectedly due to runtime errors, misconfigurations, or dependency failures, it can be restarted indefinitely, leading to high CPU usage or log flooding.
Symptoms of the Problem
- Container exits every few seconds with status code 1 or 137
- Docker logs show no meaningful output
- Systemd or Kubernetes reports CrashLoopBackOff
- Services dependent on the container time out
Root Causes of Infinite Container Restarts
1. Faulty Entry Point or CMD
Incorrect scripts, missing binaries, or misconfigured environment variables can cause the container to crash immediately at launch.
2. Resource Limits
Memory or CPU throttling can cause the kernel to kill the container process (OOMKilled), which Docker interprets as a failure.
3. Crash in Background Daemon
If the main process forks and exits, Docker assumes the container has finished, leading to unintended restarts even if child processes run fine.
4. Dependency Failures
Containers that rely on unavailable services (e.g., databases or queues) may exit on connection errors unless retries are handled properly in code.
Diagnostic Techniques
Step 1: Inspect Container Logs
docker logs <container_id>
If the logs are empty or too brief, use --log-driver
options to route output to syslog or files for persistent debugging.
Step 2: Analyze Exit Codes
docker inspect <container_id> --format='{{.State.ExitCode}}'
Common codes include 1 (general error), 137 (killed by OOM), and 139 (segmentation fault).
Step 3: Disable Restart Temporarily
docker update --restart=no <container_id>
This allows debugging without the container restarting repeatedly.
Step 4: Override Entrypoint for Debug
docker run -it --entrypoint /bin/sh <image>
Use this to enter the container's filesystem and manually execute commands to isolate the failure.
Fix Strategies
1. Correct EntryPoint and CMD
Verify that the default script handles errors gracefully and stays alive if required. Use long-running processes (e.g., tail -f /dev/null
) for testing.
2. Add Health Checks
Use HEALTHCHECK
in Dockerfiles to provide Docker better insight into actual container health instead of relying on process exits.
3. Set Graceful Retry Logic
Ensure apps retry failed connections instead of crashing. Use exponential backoff strategies for startup routines.
4. Increase Resource Limits
docker run --memory=1g --cpus=1.5 ...
Tune container resource limits and monitor them using docker stats
or Prometheus exporters.
5. Validate Base Image Updates
Changes in base images may introduce missing dependencies. Rebuild with pinned versions and test in CI before rolling out.
Architectural Considerations
Container Lifecycle Control
Design services to fail fast but not fail fatally. Implement supervisors inside containers (e.g., s6-overlay, dumb-init) to manage subprocesses robustly.
Kubernetes Implications
In Kubernetes, crash loops cause deployment rollbacks or delayed scaling. Use liveness and readiness probes to decouple container crashes from pod health when appropriate.
Best Practices
- Log early in startup scripts to identify entry-point failures
- Pin all base images and dependencies for deterministic builds
- Use
restart=on-failure
with limit caps instead ofalways
in production - Isolate external dependencies during CI/CD testing
- Build minimal containers to reduce surface area for faults
Conclusion
Infinite container restarts represent a complex intersection of bad configuration, faulty assumptions, and overlooked edge cases. By dissecting failure causes and introducing proper diagnostics, DevOps teams can stabilize deployments and reduce MTTR (Mean Time To Recovery). Prevention requires aligning Docker configurations with application behaviors, especially around lifecycle management and fault tolerance.
FAQs
1. What's the difference between Exit Code 1 and 137 in Docker?
Exit code 1 is a general application error, while 137 typically indicates the process was killed by the kernel due to resource exhaustion.
2. How can I stop a container from restarting automatically?
Use docker update --restart=no
or modify the Docker Compose file to set restart: "no"
. This halts restart loops during debugging.
3. Why does my container exit even though the process is running?
If the main PID exits (e.g., a parent script that launches a background job), Docker considers the container terminated. Use proper process supervision.
4. Is using "restart: always" a good idea?
Only for containers designed to self-heal. Otherwise, use "on-failure" with retry limits to avoid masking real issues.
5. Can Docker health checks prevent restart loops?
Health checks help, but they don't stop restarts. They allow orchestrators to make smarter scheduling and failover decisions.