In this article, we will analyze the causes of high cardinality in Prometheus, explore debugging techniques, and provide best practices to optimize Prometheus for large-scale monitoring.
Understanding High Cardinality and Memory Issues in Prometheus
High cardinality in Prometheus occurs when excessive unique label values generate an unmanageable number of time series, leading to memory exhaustion. Common causes include:
- Excessive use of dynamic labels such as request IDs, user IDs, and timestamps.
- Poorly designed queries with inefficient
regex
matching. - Incorrect storage retention policies causing excessive data retention.
- Scraping high-frequency metrics without downsampling.
- Misconfigured remote storage leading to query latency.
Common Symptoms
- High Prometheus memory usage leading to frequent OOM kills.
- Slow query response times, especially for wide time ranges.
- Excessive number of active time series in
prometheus_tsdb
. - Unresponsive Grafana dashboards due to inefficient queries.
- Storage exhaustion caused by long-term retention policies.
Diagnosing High Cardinality and Memory Issues in Prometheus
1. Checking the Number of Active Time Series
Monitor time series growth:
curl -s http://localhost:9090/api/v1/status/tsdb | jq .data.headStats.numSeries
2. Identifying High-Cardinality Labels
Analyze labels with excessive unique values:
curl -s http://localhost:9090/api/v1/label/__name__/values
3. Profiling Memory Usage
Check Prometheus memory footprint:
top -p $(pgrep -d , prometheus)
4. Evaluating Query Performance
Identify slow queries in Prometheus:
sum(rate(prometheus_engine_query_duration_seconds[5m]))
5. Analyzing Storage Utilization
Check database metrics to find inefficiencies:
curl -s http://localhost:9090/api/v1/query --data-urlencode "query=prometheus_tsdb_storage_blocks_bytes"
Fixing High Cardinality and Memory Issues in Prometheus
Solution 1: Reducing High-Cardinality Labels
Limit dynamic labels and use structured indexing:
- job_name: "app" static_configs: - targets: ["localhost:9090"] labels: instance: "web-server-1" environment: "prod"
Solution 2: Optimizing Query Performance
Use label filtering instead of regex-heavy queries:
rate(http_requests_total{method="GET"}[5m])
Solution 3: Implementing Downsampling
Reduce metric frequency to avoid excessive storage usage:
record: http_requests:rate5m expr: rate(http_requests_total[5m])
Solution 4: Adjusting Storage Retention Policies
Configure retention settings to prevent excessive data accumulation:
--storage.tsdb.retention.time=15d
Solution 5: Offloading Long-Term Storage
Use remote storage for historical data:
remote_write: - url: "http://remote-storage:9201/write"
Best Practices for Scalable Prometheus Monitoring
- Avoid using high-cardinality labels like request IDs or timestamps.
- Optimize PromQL queries to prevent unnecessary data processing.
- Use downsampling and recording rules to reduce query load.
- Adjust retention policies to balance data retention and storage needs.
- Offload historical data to remote storage to keep Prometheus fast.
Conclusion
High cardinality and excessive memory usage can severely impact Prometheus performance. By optimizing metric labeling, query execution, and storage retention, developers can ensure an efficient and scalable monitoring setup.
FAQ
1. Why is my Prometheus instance using too much memory?
Common causes include excessive time series, high-cardinality labels, and inefficient query execution.
2. How can I reduce high-cardinality labels in Prometheus?
Avoid including dynamic values like request IDs or timestamps in labels.
3. What is the best way to improve slow PromQL queries?
Use indexed label filtering instead of regex-heavy queries.
4. How do I set up long-term storage for Prometheus?
Configure remote_write
to send historical data to external storage.
5. What is the ideal retention policy for Prometheus?
For most applications, a 15-30 day retention period balances storage and performance.