1. Data Ingestion Failures

Understanding the Issue

Druid fails to ingest data, leading to incomplete datasets and failed indexing tasks.

Root Causes

  • Malformed input data or schema mismatches.
  • Insufficient memory allocated for ingestion tasks.
  • Improper configuration of data sources.

Fix

Check ingestion task logs for errors:

druid.indexer.runner.logs

Validate input data format:

cat sample-data.json | jq .

Increase memory allocation for ingestion:

druid.indexer.fork.property.druid.processing.buffer.sizeBytes=536870912

2. Slow Query Performance

Understanding the Issue

Druid queries take longer than expected, affecting real-time analytics performance.

Root Causes

  • High cardinality dimensions slowing down indexing.
  • Unoptimized segment granularity settings.
  • Insufficient memory or CPU allocation.

Fix

Use query profiling to analyze slow queries:

EXPLAIN PLAN FOR SELECT * FROM my_table WHERE dimension='value';

Adjust segment granularity for better performance:

druid.indexer.task.segmentGranularity=DAY

Increase historical node memory allocation:

druid.processing.buffer.sizeBytes=1073741824

3. Out-of-Memory (OOM) Errors

Understanding the Issue

Druid services crash due to memory exhaustion, affecting cluster stability.

Root Causes

  • Incorrect JVM heap size allocation.
  • Large ingestion jobs consuming excessive memory.
  • Improper caching settings causing memory spikes.

Fix

Adjust JVM heap settings:

export DRUID_JVM_OPTIONS="-Xms4g -Xmx8g"

Enable result-level caching to reduce memory load:

druid.cache.useCache=true

Optimize data segments by reducing granularity:

druid.segmentCache.numLoadingThreads=4

4. Coordinator and Overlord Failures

Understanding the Issue

The Druid coordinator and overlord services stop responding, affecting cluster management.

Root Causes

  • High ingestion workloads overwhelming the services.
  • Zookeeper connectivity issues.
  • Metadata database failures.

Fix

Restart coordinator and overlord services:

sudo systemctl restart druid-coordinator druid-overlord

Check Zookeeper connectivity:

echo srvr | nc zookeeper-host 2181

Ensure metadata storage database is running:

systemctl status postgresql

5. Indexing Task Failures

Understanding the Issue

Indexing tasks fail or get stuck, preventing data from being available for queries.

Root Causes

  • Incorrect task specifications in ingestion configurations.
  • Insufficient worker slots for task execution.
  • Segment version conflicts causing task failures.

Fix

Check task logs for errors:

curl -X GET "http://druid-overlord-host:8090/druid/indexer/v1/tasks"

Increase worker slots:

druid.worker.capacity=4

Resolve segment conflicts by re-running compaction tasks:

druid.coordinator.compaction.enable=true

Conclusion

Apache Druid is a high-speed analytics database, but troubleshooting ingestion failures, slow queries, out-of-memory errors, coordinator/overlord failures, and indexing task issues is essential for maintaining optimal performance. By fine-tuning memory allocation, optimizing segment granularity, and ensuring proper metadata storage, users can enhance the stability and efficiency of their Druid clusters.

FAQs

1. Why is my Druid ingestion task failing?

Check task logs for schema mismatches, validate input data, and increase memory allocation.

2. How do I optimize slow queries in Druid?

Adjust segment granularity, optimize query plans, and increase memory for historical nodes.

3. What causes out-of-memory errors in Druid?

Inadequate JVM heap size, inefficient caching, and large ingestion jobs can lead to memory exhaustion.

4. How do I fix coordinator and overlord failures?

Restart services, check Zookeeper connectivity, and ensure metadata database availability.

5. How do I resolve indexing task errors?

Verify task specifications, increase worker slots, and re-run compaction tasks to resolve conflicts.