Understanding the Restart Loop Problem

How Docker Handles Restarts

Docker uses restart policies like --restart=always or --restart=on-failure to determine whether a container should be restarted after it exits. If a container exits unexpectedly due to runtime errors, misconfigurations, or dependency failures, it can be restarted indefinitely, leading to high CPU usage or log flooding.

Symptoms of the Problem

  • Container exits every few seconds with status code 1 or 137
  • Docker logs show no meaningful output
  • Systemd or Kubernetes reports CrashLoopBackOff
  • Services dependent on the container time out

Root Causes of Infinite Container Restarts

1. Faulty Entry Point or CMD

Incorrect scripts, missing binaries, or misconfigured environment variables can cause the container to crash immediately at launch.

2. Resource Limits

Memory or CPU throttling can cause the kernel to kill the container process (OOMKilled), which Docker interprets as a failure.

3. Crash in Background Daemon

If the main process forks and exits, Docker assumes the container has finished, leading to unintended restarts even if child processes run fine.

4. Dependency Failures

Containers that rely on unavailable services (e.g., databases or queues) may exit on connection errors unless retries are handled properly in code.

Diagnostic Techniques

Step 1: Inspect Container Logs

docker logs <container_id>

If the logs are empty or too brief, use --log-driver options to route output to syslog or files for persistent debugging.

Step 2: Analyze Exit Codes

docker inspect <container_id> --format='{{.State.ExitCode}}'

Common codes include 1 (general error), 137 (killed by OOM), and 139 (segmentation fault).

Step 3: Disable Restart Temporarily

docker update --restart=no <container_id>

This allows debugging without the container restarting repeatedly.

Step 4: Override Entrypoint for Debug

docker run -it --entrypoint /bin/sh <image>

Use this to enter the container's filesystem and manually execute commands to isolate the failure.

Fix Strategies

1. Correct EntryPoint and CMD

Verify that the default script handles errors gracefully and stays alive if required. Use long-running processes (e.g., tail -f /dev/null) for testing.

2. Add Health Checks

Use HEALTHCHECK in Dockerfiles to provide Docker better insight into actual container health instead of relying on process exits.

3. Set Graceful Retry Logic

Ensure apps retry failed connections instead of crashing. Use exponential backoff strategies for startup routines.

4. Increase Resource Limits

docker run --memory=1g --cpus=1.5 ...

Tune container resource limits and monitor them using docker stats or Prometheus exporters.

5. Validate Base Image Updates

Changes in base images may introduce missing dependencies. Rebuild with pinned versions and test in CI before rolling out.

Architectural Considerations

Container Lifecycle Control

Design services to fail fast but not fail fatally. Implement supervisors inside containers (e.g., s6-overlay, dumb-init) to manage subprocesses robustly.

Kubernetes Implications

In Kubernetes, crash loops cause deployment rollbacks or delayed scaling. Use liveness and readiness probes to decouple container crashes from pod health when appropriate.

Best Practices

  • Log early in startup scripts to identify entry-point failures
  • Pin all base images and dependencies for deterministic builds
  • Use restart=on-failure with limit caps instead of always in production
  • Isolate external dependencies during CI/CD testing
  • Build minimal containers to reduce surface area for faults

Conclusion

Infinite container restarts represent a complex intersection of bad configuration, faulty assumptions, and overlooked edge cases. By dissecting failure causes and introducing proper diagnostics, DevOps teams can stabilize deployments and reduce MTTR (Mean Time To Recovery). Prevention requires aligning Docker configurations with application behaviors, especially around lifecycle management and fault tolerance.

FAQs

1. What's the difference between Exit Code 1 and 137 in Docker?

Exit code 1 is a general application error, while 137 typically indicates the process was killed by the kernel due to resource exhaustion.

2. How can I stop a container from restarting automatically?

Use docker update --restart=no or modify the Docker Compose file to set restart: "no". This halts restart loops during debugging.

3. Why does my container exit even though the process is running?

If the main PID exits (e.g., a parent script that launches a background job), Docker considers the container terminated. Use proper process supervision.

4. Is using "restart: always" a good idea?

Only for containers designed to self-heal. Otherwise, use "on-failure" with retry limits to avoid masking real issues.

5. Can Docker health checks prevent restart loops?

Health checks help, but they don't stop restarts. They allow orchestrators to make smarter scheduling and failover decisions.