Understanding the Root of Query Inconsistency

How Druid Ingestion Works

Druid supports real-time ingestion using middle managers or indexers, where segments are published periodically. Until published, data is stored in-memory and not visible to historical nodes. Queries against this data hit both real-time and historical nodes for completeness.

Concurrency Problems

Under concurrent ingestion and querying, query routers may miss real-time data due to:

  • Lag in segment handoff to historical nodes
  • Overloaded real-time nodes dropping events under memory pressure
  • Incorrect minTime/maxTime boundaries causing temporal filtering issues

Architectural Implications

Inconsistent query results jeopardize:

  • Data integrity in operational dashboards
  • Trust in near-real-time analytics
  • Complex alerting or anomaly detection workflows

This behavior may trigger executive-level escalations if financial, fraud, or customer activity metrics appear delayed or incorrect.

Diagnosing Inconsistency

Symptoms to Watch For

  • Recent records missing from aggregations
  • Flaky test results in CI/CD with query-based validations
  • Different answers for the same query within short time intervals

Confirming via Query Logs

Enable query/intersection logs and inspect the segments scanned:

{
  "queryTime": "2025-07-21T14:00:00Z",
  "segmentsScanned": ["segment_2025-07-21T13:00:00Z_...", "segment_realtime_2025..."]
}

If the real-time segment is missing, that indicates incomplete ingestion visibility.

Step-by-Step Remediation

1. Enable Query Caching Only on Historical Nodes

Query caching on real-time nodes can cause stale or partial responses. Restrict cache to historical nodes via:

"druid.broker.cache.useCache": true,
"druid.broker.cache.populateCache": true,
"druid.realtime.cache.useCache": false

2. Monitor Segment Handoff Lag

Use Druid metrics druid.segment/handoff/lag and druid.segment/publish/time to visualize publishing delays. If they spike, increase memory limits or reduce task duration intervals.

3. Synchronize Query Offsets with Lag

To avoid querying uncommitted data, offset query times by a small delay buffer:

"interval": "2025-07-21T13:00:00Z/2025-07-21T13:59:00Z"

This ensures queries don't overrun the ingestion timeline.

4. Use Kafka Ingestion Tuning

In Kafka-based ingestion, set proper maxRowsPerSegment and intermediatePersistPeriod to force earlier publishing of segments:

{
  "type": "kafka",
  "tuningConfig": {
    "maxRowsPerSegment": 100000,
    "intermediatePersistPeriod": "PT10M"
  }
}

Best Practices for Stability

  • Stagger heavy queries and ingestion windows
  • Pin brokers to preferred historical nodes for stable routing
  • Use kill tasks to remove outdated real-time segments
  • Set alarms on ingestion lag and handoff failures
  • Enforce query time boundaries in APIs to avoid open-ended time ranges

Conclusion

Query inconsistency in Apache Druid is a byproduct of the system's real-time ingestion design, particularly when scaling across many nodes. With careful tuning of caching, query intervals, segment persistence settings, and node-specific routing strategies, teams can mitigate the risk and ensure data accuracy. Mature monitoring and strict ingestion/query separation are keys to enterprise-grade reliability in Druid-based architectures.

FAQs

1. Why are real-time Druid segments not queried consistently?

They may not be visible yet to brokers due to ingestion lag, memory pressure, or delayed publishing. This is common when querying data that is still being ingested.

2. Should we disable real-time nodes for querying?

Not always. But in critical systems, routing queries only to historical nodes via tiered broker rules provides higher consistency guarantees.

3. What causes segment handoff failures?

Low disk space, misconfigured deep storage, or time-based rollup gaps can prevent segments from publishing and moving to historical nodes.

4. How does Druid handle late-arriving data?

Druid can reindex segments with late data, but this requires compaction or overwrite tasks. It won't be visible immediately unless real-time tasks are still active for the interval.

5. Can we use Druid for strict real-time SLAs?

Only with caution. Druid offers near-real-time ingestion but consistency under load must be engineered with ingestion controls, buffer delays, and observability in place.