Background and Architectural Context
How Ansible Gathers Facts
By default, Ansible gathers facts at the start of each play for every host. This data is stored in the variable `ansible_facts` and used in conditionals, role logic, and templating. In large environments, or when dealing with unstable hosts, this process can time out or produce inconsistent data.
Inventory Complexity in Enterprises
Enterprise inventories are often dynamic — integrated with external sources like AWS EC2, VMware, or CMDB APIs. Latency, API rate limits, or incorrect plugin configurations can lead to partial inventories or inconsistent host targeting, especially when caching is disabled or misconfigured.
Common Symptoms
- Tasks fail intermittently without consistent host patterns
- Facts appear missing or outdated in templates and tasks
- Playbooks skip expected hosts without error
- Inventory resolution errors during ad hoc runs
Root Causes
1. Unreliable Fact Gathering
Facts gathering may fail due to SSH latency, restricted `sudo` configurations, or long-running custom facts scripts on the host. This leads to partial fact data or total absence, causing templating failures or skipped logic.
2. Inventory Plugin Misconfiguration
Dynamic inventory plugins (e.g., `aws_ec2`, `vmware_vm_inventory`) can silently fail if credentials, regions, or filters are misconfigured. This results in empty or incomplete host lists.
3. Disabled or Expired Inventory Caches
For performance, plugins support caching, but default TTLs can expire or cache files can be corrupted. Without valid cache, Ansible may query live data inconsistently or fail entirely during high-volume runs.
4. Asynchronous Task Dependencies on Facts
When tasks use `async` or `poll` but rely on host facts, race conditions can arise where facts are no longer valid by the time the task executes.
Diagnostics and Debugging Steps
Enable Verbose Logging
Run Ansible with `-vv` or `-vvv` to inspect failed task output, especially around fact gathering and inventory loading:
ansible-playbook site.yml -vvv
Check Facts Collection Manually
Run the `setup` module on problematic hosts to validate fact availability:
ansible all -m setup -i inventory.yml
Inspect Dynamic Inventory Source
Test dynamic inventory scripts or plugins manually:
ansible-inventory -i aws_ec2.yml --list
Step-by-Step Fix
1. Disable Unnecessary Fact Gathering
Use `gather_facts: false` in plays where facts are not needed. For known stable values, set facts explicitly with `set_fact`.
2. Fix or Harden Inventory Plugins
Ensure dynamic inventories are configured with all required environment variables, IAM roles, or authentication tokens.
3. Enable and Manage Inventory Caching
Use persistent fact and inventory caching in `ansible.cfg`:
[defaults] fact_caching = jsonfile fact_caching_connection = /tmp/ansible_facts inventory_cache = True cache_plugin = jsonfile
4. Guard Against Missing Facts
Use Jinja2 filters like `default` or `if in` to protect against missing fact-based variables:
- debug: msg="OS is {{ ansible_facts.os_family | default('unknown') }}"
5. Serialize Tasks That Depend on Facts
For tasks that use `async` or facts-dependent logic, use `serial` or avoid backgrounding to ensure data validity.
Best Practices for Stable Ansible Automation
- Pin plugin versions and validate against Ansible core releases
- Document and audit dynamic inventory filters and credentials
- Store essential facts in host_vars where appropriate
- Use retries and handlers for flaky tasks
- Set a consistent `gather_subset` to limit heavy fact collection
Conclusion
Ansible's flexibility and power make it indispensable in enterprise automation, but intermittent task failures can undermine its reliability. Root causes often lie in inconsistent fact collection, dynamic inventory quirks, or improperly handled async execution. By tightening configuration, disabling unnecessary behaviors, and applying defensive templating patterns, teams can eliminate instability and scale Ansible usage with confidence.
FAQs
1. Should I disable fact gathering globally?
Only if facts are unused or replaced by static vars. Otherwise, disable selectively at the play level to optimize performance.
2. How do I detect expired inventory cache?
Inspect timestamps in the cache directory or enable verbose mode. Logs will indicate if Ansible regenerates the cache on the fly.
3. Why do some hosts not appear during dynamic inventory runs?
Check plugin filters, region settings, or auth. Missing tags or IAM permission issues can exclude hosts silently.
4. What causes facts to disappear between tasks?
If you use `delegate_to` or `async` improperly, facts may not persist across task boundaries. Use `set_fact` for persistence.
5. Can I cache custom facts?
Yes. Custom facts in `/etc/ansible/facts.d/` or registered variables can be cached with persistent fact caching backends like jsonfile or redis.