Understanding Zabbix Polling Mechanisms
Types of Items and Polling Models
Zabbix uses two main types of item checks: passive (agent) and active (agent active), along with SNMP, HTTP, and external scripts. Passive items are polled by Zabbix servers or proxies, while active items push data to the server. Misbalancing these can overload the poller system.
Example passive check: Type: Zabbix agent Key: system.cpu.load[percpu,avg1] Example active check: Type: Zabbix agent (active) Key: vfs.fs.size[/,free]
Polling Delay Symptoms
- Graphs show flat lines or missing data
- Triggers fire erroneously due to stale values
- Zabbix front-end shows high queue values
Architectural Root Causes
1. Poller/Unreachable Poller Bottlenecks
If the number of poller processes is too low, items will queue. Especially true for items with high frequency (5s, 10s) or long response times.
2. Overloaded Proxies
Zabbix proxies may become bottlenecks when monitoring many remote hosts. Disk I/O, slow MySQL/SQLite responses, or misconfigured cache sizes lead to delays in item processing and sending data upstream.
3. Network and SNMP Timeouts
High-latency SNMP devices or misconfigured retries/timeouts can cause pollers to hang, delaying the next items in the polling queue.
4. Custom Scripts or External Checks
Slow or inefficient external scripts can occupy poller threads. If scripts have no timeout logic or error handling, they block pollers indefinitely.
Diagnostics and Observability
Check Poller Queue in Zabbix Frontend
Navigate to Monitoring → Latest Data → Queue. Items delayed more than 5 seconds are considered problematic. Sort by delay to identify patterns by host or item type.
Analyze Poller Performance
Enable internal Zabbix metrics:
zabbix[process,*,busy]
Monitor:
zabbix[process,poller,busy]
zabbix[process,unreachable poller,avg]
zabbix[queue]
Review zabbix_server.log
Look for entries like:
unreachable poller #xx [got empty string from...] poller #xx [cannot obtain data]
This helps isolate which pollers are hanging or skipping checks.
Common Pitfalls in Scaling Zabbix
1. Default Poller Limits
The default configuration provides only 5 pollers. This is insufficient for enterprise-scale setups monitoring thousands of items every minute.
2. No Timeout on Custom Items
Scripts or user parameters without timeout controls may block indefinitely. Always wrap external calls with timeout
or similar mechanisms.
3. Large Number of Low-Interval Items
Items with 5s or 10s update intervals add significant load. Prioritize critical metrics for short intervals and move others to 60s+ or use traps.
Step-by-Step Fixes
Step 1: Increase Poller and Related Processes
StartPollers=20 StartPollersUnreachable=10 StartSNMPPollers=15 StartPingers=10
Update these in zabbix_server.conf
or zabbix_proxy.conf
, then restart the service.
Step 2: Optimize SNMP Timeouts
In host configuration or global SNMP settings:
Timeout=2 Retries=1
Ensure slow devices don't block pollers too long.
Step 3: Tune Item Update Intervals
Group items by importance:
- Critical: 10s–30s
- Standard: 60s
- Non-essential: 300s or passive/trap-only
Step 4: Offload to Proxies
Deploy proxies in network segments with high item counts or latency. This distributes load and reduces server-side processing time.
Step 5: Refactor Custom Scripts
External scripts should:
- Return values in <1s where possible
- Include timeout and error handling
- Avoid excessive I/O or forks
Best Practices for Long-Term Monitoring Health
- Regularly audit item update intervals and disable unused items
- Split monitoring into proxies for large infrastructures
- Enable preprocessing to offload expression calculations
- Use LLD (Low-Level Discovery) with care; avoid generating too many items
- Leverage traps for infrequent state-based metrics (e.g., service up/down)
Conclusion
High-latency polling in Zabbix often stems from under-provisioned pollers, inefficient item configurations, or overloaded proxies. The solution lies in thoughtful architectural distribution, polling optimization, and aggressive script hardening. By tuning pollers, load balancing with proxies, and refactoring custom checks, you can restore timely item updates and maintain high observability across distributed systems.
FAQs
1. How can I tell if my pollers are overloaded?
Check internal metrics like zabbix[process,poller,avg,busy]
. Values consistently >0.75 indicate overloaded pollers. Also monitor the item queue under "Latest data → Queue".
2. Can Zabbix proxies reduce polling delays?
Yes. Proxies allow load to be distributed geographically or by environment. This reduces traffic and polling overhead on the main Zabbix server.
3. What is a good number of pollers for 10,000 items?
Depends on item types and intervals, but typically 30–50 pollers (plus SNMP and unreachable pollers) are required for responsive updates at this scale.
4. Why are SNMP items especially slow?
SNMP devices often have slow response times or timeouts. Optimize with lower retry counts and timeouts, and isolate them on separate proxies or SNMP-specific pollers.
5. Is preprocessing a solution for polling lag?
Preprocessing helps reduce server load by moving computation to input parsing. While it doesn't solve raw polling latency, it improves overall performance and trigger evaluation speed.