PagerDuty Architecture Overview
Core Components
- Services: Represent monitored applications or infrastructure
- Escalation Policies: Define who is notified and when
- Schedules: Control availability of responders
- Event Rules: Route alerts based on payload conditions
- Integrations: Connect monitoring and observability tools like Datadog, Prometheus, or Splunk
Common Complexity Points
In enterprise environments, with hundreds of services and responders, misalignments in event routing logic, alert deduplication, or integration event formatting can cause major operational gaps.
Key Troubleshooting Scenarios
1. Alert Storms and False Positives
One of the most debilitating issues is receiving dozens of alerts for the same incident, often caused by misconfigured monitoring thresholds or lack of proper deduplication keys.
{ "routing_key": "YOUR-INTEGRATION-KEY", "event_action": "trigger", "dedup_key": "database-down", "payload": { "summary": "DB unreachable", "source": "db01.prod", "severity": "critical" } }
Ensure deduplication keys are set and consistently used by upstream tools. Tune thresholds in the monitoring source to reduce alert noise.
2. Escalation Policies Not Executing
Escalation steps may not trigger if schedule overlaps or on-call user rotations are misconfigured. Verify schedules align with escalation timing.
Check: Schedules → Preview Schedule → Verify expected users for timeframe
Use the audit log and timeline view on incidents to trace missed escalations.
3. Delayed or Missing Alerts
This typically stems from throttling or dropped events in the integration pipeline. PagerDuty enforces API rate limits (default: 7500 events per minute).
Check integration logs for HTTP 429 status codes Implement backoff and retry mechanisms in your event sources
Split large alert loads across multiple services if needed.
4. Broken Monitoring Tool Integrations
Incompatible payload formats or deprecated webhook versions can silently fail alert delivery.
Use the Integration Events API to validate payloads before deploying changes. Set up logging for all outbound alert payloads from tools like Prometheus or New Relic.
5. Responders Not Receiving Notifications
Notifications may fail due to muted contact methods, incorrect time zone settings, or unverified user profiles.
Admin → Users → Select User → Contact Methods → Ensure phone/SMS/email is verified
Use notification audit logs to confirm delivery attempts and errors.
Diagnostic Tools and Techniques
- Incident Timeline: View all event triggers and escalations in sequence
- Audit Logs: Track configuration and schedule changes
- Live Call Routing: Test escalation flow in real-time
- Event Rules Debugger: Evaluate how events are processed and routed
Best Practices
- Normalize event payloads across monitoring tools
- Use consistent dedup_keys and meaningful incident summaries
- Group alerts using Event Rules to reduce noise
- Run quarterly fire drills to test escalation flows
- Define clear ownership in each service's responder list
Conclusion
PagerDuty is a powerful platform, but its effectiveness depends on precise configuration and maintenance. Enterprise-scale deployments must be continuously audited and tuned to avoid operational overhead, alert fatigue, or missed SLAs. By diagnosing common pain points—from alert duplication to escalation flow gaps—and implementing structured remediation, teams can ensure resilient, efficient incident management pipelines.
FAQs
1. How can I reduce alert noise in PagerDuty?
Use event rules to suppress non-critical alerts, implement deduplication keys, and normalize thresholds in upstream monitoring systems.
2. Why didn't my escalation policy trigger the next responder?
Check schedule overlaps, time zone misalignment, and that the user is not outside of their on-call window during the escalation step.
3. How do I know if PagerDuty is dropping alerts?
Monitor integration logs for API rate limit errors and validate payload delivery using the Events API or webhook logging tools.
4. Can I test incident routing without triggering real alerts?
Yes. Use the "Create Test Incident" feature or the REST API with non-critical severities and test services to simulate routing flows.
5. How do I handle alert spikes during outages?
Enable alert grouping, use deduplication keys, and configure rate limits or buffering mechanisms in upstream systems to throttle alert volume.