Troubleshooting High Cardinality in Prometheus: Fixing Memory Overuse and Slow Query Performance

Details: Category: Troubleshooting Tips; By Mindful Chase; 30.Jan; Hits: 329

Prometheus is a powerful monitoring and alerting toolkit designed for reliability and scalability. However, DevOps teams working with large-scale monitoring often encounter a rarely discussed yet critical issue: high cardinality metrics causing excessive memory usage and slow query performance. This can lead to Prometheus crashing, slow response times, and an overall degraded monitoring experience.

Mindful Chase

Writing Code, Writing Stories

tbd

Experience

tbd

More to Explore

In this article, we will analyze the causes of high cardinality in Prometheus, explore debugging techniques, and provide best practices to optimize performance while maintaining visibility into key metrics.

Understanding High Cardinality in Prometheus

High cardinality occurs when a metric has too many unique label combinations (time series). This leads to:

Increased memory consumption, making Prometheus slow or unresponsive.
Slow queries due to excessive data stored in TSDB (Time Series Database).
OOM (Out of Memory) crashes when handling large numbers of time series.
Longer scrape and storage durations, affecting monitoring reliability.

Common Symptoms

Prometheus taking too long to respond to queries.
Frequent out of memory or OOMKilled events.
Increased CPU usage during scrapes.
Slow Grafana dashboards due to inefficient queries.

Diagnosing High Cardinality Issues

1. Identifying High Cardinality Metrics

List the highest cardinality series using PromQL:

count by (__name__) ({__name__=~".*"})

This identifies metrics with a large number of time series.

2. Checking Storage Usage

Analyze storage usage per metric:

prometheus_tsdb_head_series

This helps track excessive time series growth.

3. Monitoring Label Explosion

Find labels contributing to high cardinality:

count by (job, instance) ({__name__=~".*"})

4. Profiling Query Execution Time

Identify slow queries using:

exemplar_storage_active_series

Fixing High Cardinality in Prometheus

Solution 1: Reducing Label Combinations

Remove unnecessary labels from metrics:

metric_name{label1="value1", label2="value2"} -> metric_name{label1="value1"}

Solution 2: Using Histogram Buckets Wisely

Avoid excessive histogram buckets:

histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

Solution 3: Dropping Unnecessary Metrics

Use relabeling to filter out unneeded metrics:

relabel_configs:
  - source_labels: [__name__]
    regex: "high_cardinality_metric.*"
    action: drop

Solution 4: Enabling Remote Storage for Scalability

Move long-term data to external storage:

remote_write:
  - url: "https://remote-storage.example.com/api/v1/write"

Solution 5: Optimizing PromQL Queries

Use aggregation functions to reduce time series count:

sum(rate(http_requests_total[5m])) by (job)

Best Practices for Managing High Cardinality

Regularly audit metrics using count by (__name__).
Use minimal labels to reduce unique time series.
Apply relabeling to drop unnecessary metrics.
Use remote storage for long-term retention.
Optimize PromQL queries for efficiency.

Conclusion

High cardinality in Prometheus can degrade performance and cause resource exhaustion. By limiting label combinations, optimizing queries, and leveraging remote storage, DevOps teams can ensure a scalable and efficient monitoring setup.

FAQ

1. Why is my Prometheus using excessive memory?

High cardinality metrics create too many time series, leading to increased memory consumption.

2. How do I identify high cardinality metrics?

Use count by (__name__) to list metrics with the most time series.

3. Can I reduce storage without losing important data?

Yes, use relabeling to drop unnecessary metrics and move long-term data to remote storage.

4. What is the best way to optimize PromQL queries?

Use aggregation functions like sum() and rate() to minimize data processed per query.

5. How do I prevent Prometheus from crashing due to memory overload?

Limit the number of time series by reducing labels, optimizing queries, and using external storage.

Contact Us