In this article, we will analyze the causes of high cardinality in Prometheus, explore debugging techniques, and provide best practices to optimize performance while maintaining visibility into key metrics.
Understanding High Cardinality in Prometheus
High cardinality occurs when a metric has too many unique label combinations (time series). This leads to:
- Increased memory consumption, making Prometheus slow or unresponsive.
- Slow queries due to excessive data stored in TSDB (Time Series Database).
- OOM (Out of Memory) crashes when handling large numbers of time series.
- Longer scrape and storage durations, affecting monitoring reliability.
Common Symptoms
- Prometheus taking too long to respond to queries.
- Frequent
out of memory
orOOMKilled
events. - Increased CPU usage during scrapes.
- Slow Grafana dashboards due to inefficient queries.
Diagnosing High Cardinality Issues
1. Identifying High Cardinality Metrics
List the highest cardinality series using PromQL:
count by (__name__) ({__name__=~".*"})
This identifies metrics with a large number of time series.
2. Checking Storage Usage
Analyze storage usage per metric:
prometheus_tsdb_head_series
This helps track excessive time series growth.
3. Monitoring Label Explosion
Find labels contributing to high cardinality:
count by (job, instance) ({__name__=~".*"})
4. Profiling Query Execution Time
Identify slow queries using:
exemplar_storage_active_series
Fixing High Cardinality in Prometheus
Solution 1: Reducing Label Combinations
Remove unnecessary labels from metrics:
metric_name{label1="value1", label2="value2"} -> metric_name{label1="value1"}
Solution 2: Using Histogram Buckets Wisely
Avoid excessive histogram buckets:
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
Solution 3: Dropping Unnecessary Metrics
Use relabeling to filter out unneeded metrics:
relabel_configs: - source_labels: [__name__] regex: "high_cardinality_metric.*" action: drop
Solution 4: Enabling Remote Storage for Scalability
Move long-term data to external storage:
remote_write: - url: "https://remote-storage.example.com/api/v1/write"
Solution 5: Optimizing PromQL Queries
Use aggregation functions to reduce time series count:
sum(rate(http_requests_total[5m])) by (job)
Best Practices for Managing High Cardinality
- Regularly audit metrics using
count by (__name__)
. - Use minimal labels to reduce unique time series.
- Apply relabeling to drop unnecessary metrics.
- Use remote storage for long-term retention.
- Optimize PromQL queries for efficiency.
Conclusion
High cardinality in Prometheus can degrade performance and cause resource exhaustion. By limiting label combinations, optimizing queries, and leveraging remote storage, DevOps teams can ensure a scalable and efficient monitoring setup.
FAQ
1. Why is my Prometheus using excessive memory?
High cardinality metrics create too many time series, leading to increased memory consumption.
2. How do I identify high cardinality metrics?
Use count by (__name__)
to list metrics with the most time series.
3. Can I reduce storage without losing important data?
Yes, use relabeling to drop unnecessary metrics and move long-term data to remote storage.
4. What is the best way to optimize PromQL queries?
Use aggregation functions like sum()
and rate()
to minimize data processed per query.
5. How do I prevent Prometheus from crashing due to memory overload?
Limit the number of time series by reducing labels, optimizing queries, and using external storage.