Understanding Playbook Execution Hangs
Symptoms and Scenarios
- Playbooks freeze on a specific task without logging progress
- Output stalls after gathering facts
- Runs succeed sporadically or differ between environments
- SSH connections hang on unreachable hosts
Why This Matters in Enterprise Automation
Hanging playbooks break automation pipelines, delay deployments, and introduce uncertainty. In regulated or high-availability environments, these failures can cascade into SLA violations or downtime.
Root Causes and Architectural Implications
SSH Behavior and Connection Timeouts
Ansible communicates via SSH. A hung task often points to a stalled SSH connection. Without proper timeouts or keepalive settings, unreachable hosts cause indefinite blocking.
Fact Gathering Bottlenecks
The default gather_facts: true
setting triggers Ansible to run setup modules on each host. On unstable or resource-constrained nodes, this can hang or timeout silently, especially if DNS resolution or Python interpreter setup fails.
Deadlocks in Asynchronous Tasks
Improperly managed async tasks or background jobs (e.g., handlers waiting for services that never start) can deadlock the playbook without apparent errors.
Diagnosing Hanging Playbooks
Step 1: Enable Verbose Output
ansible-playbook -vvv site.yml
This provides detailed logs including SSH negotiation, module execution, and remote stderr/stdout.
Step 2: Use Timeout Parameters
ansible_ssh_common_args: '-o ConnectTimeout=10'
Define in inventory or host_vars to apply connection-level timeouts to problematic nodes.
Step 3: Isolate the Task
Use --start-at-task
and --limit
to re-run specific hosts and tasks:
ansible-playbook playbook.yml --start-at-task="Restart nginx" --limit host1
Step 4: Check for Long-Running Commands
Commands like apt update
or yum install
may wait on interactive prompts. Add -y
or ensure unattended mode is configured.
Step 5: Use strace or ps on Target Node
If a task hangs during module execution, SSH into the target node and trace the stuck process:
ps aux | grep ansible strace -p PID
Common Pitfalls in Enterprise Deployments
Slow DNS Resolution
When Ansible runs setup modules, it may call reverse DNS. Misconfigured DNS can add 10-30 seconds per host. Disable reverse lookup if not required.
SSH Key Forwarding or Agent Prompts
Without agent forwarding, SSH may prompt for passphrases mid-run. Ensure keys are loaded and agents are persistent in CI/CD environments.
Handlers Waiting on Broken Services
- name: Restart service service: name=myservice state=restarted notify: Wait for socket - name: Wait for socket wait_for: path=/var/run/myservice.sock timeout=30
Missing or incorrect paths will cause indefinite waits.
Step-by-Step Fixes
Fix 1: Use Timeout-Aware Strategies
- name: Run command with timeout command: /usr/bin/do_something async: 60 poll: 5
This ensures background tasks are monitored and terminated on failure.
Fix 2: Harden Inventory SSH Settings
[web] host1 ansible_host=10.0.0.1 ansible_ssh_common_args='-o ConnectTimeout=5 -o ServerAliveInterval=10'
Fix 3: Disable Fact Gathering When Not Needed
- hosts: all gather_facts: false
Fix 4: Introduce Execution Timeouts Globally
[defaults] timeout = 30
In ansible.cfg, this will apply a global command execution timeout.
Fix 5: Monitor Execution with Callback Plugins
Use callback_whitelist = profile_tasks
to log timing per task and identify bottlenecks.
Best Practices for Stable Automation
- Keep playbooks idempotent and time-bounded
- Always use
async
for slow or network-dependent tasks - Use retries with
until
for flaky services - Isolate environments using test inventories
- Pin Ansible and collection versions to avoid regressions
Conclusion
Hanging Ansible playbooks are not just execution annoyances—they indicate architectural friction between orchestration layers and infrastructure realities. By understanding the root causes and equipping playbooks with timeout, monitoring, and recovery logic, enterprises can ensure that automation remains a source of efficiency rather than frustration.
FAQs
1. Why does Ansible hang on gathering facts?
This often occurs due to DNS issues, unreachable hosts, or broken Python environments on the target. Disable fact gathering if not needed.
2. How do I debug a stuck Ansible playbook?
Use -vvv
for verbose output, isolate the hanging task with --start-at-task
, and check target system processes manually.
3. What's the best way to handle long-running tasks?
Use the async
and poll
pattern with timeouts to avoid indefinite hangs. Always wrap slow commands with checks.
4. Can SSH settings prevent playbook hangs?
Yes, using SSH options like ConnectTimeout
and ServerAliveInterval
ensures Ansible detects connection issues quickly.
5. Is it safe to disable fact gathering globally?
Yes, if your playbooks don't rely on facts. It speeds up execution and reduces failure points on non-standard environments.