Understanding QuestDB's Architecture

Column-Oriented Storage and Partitions

QuestDB stores data in a columnar format, partitioned by time (e.g., daily, monthly). This design accelerates time-series queries but can lead to fragmentation or lookup inefficiencies if improperly configured.

Ingestion Models: ILP vs. WAL

There are two main ingestion modes:

  • ILP (Influx Line Protocol): Fastest ingestion path, uses commit-on-write model
  • WAL (Write-Ahead Logging): Safer ingestion with eventual commit, allows async processing and recovery

Choosing the wrong model or misconfiguring WAL queues often causes ingestion lag or query inconsistencies.

Symptoms of Performance Degradation

  • Queries timing out or returning incomplete data
  • High heap usage or out-of-memory (OOM) crashes
  • Slow ingestion rates or dropped data
  • Table locks and stalled background writers

Example Errors:

Caused by: java.lang.OutOfMemoryError: Java heap space
Query timeout expired after 30s
Could not write to WAL queue, retrying...

Root Causes at Scale

1. Unbounded Queries over Wide Time Ranges

Running SELECT * or unfiltered time-based queries across multi-billion row partitions overwhelms memory-mapped buffers and can lead to JVM heap exhaustion.

2. Inefficient Partitioning Strategy

Default daily partitioning may be too granular or too coarse depending on data volume. Misaligned partitions cause unnecessary disk I/O and file descriptor exhaustion.

3. WAL Segment Saturation

WAL mode introduces write latency if commit threads are blocked. With insufficient WAL segment sizes or thread pools, ingestion stalls under pressure.

4. Overlapping Inserts or Out-of-Order Writes

QuestDB optimizes for ordered timestamps. Excessive out-of-order ingestion increases index update costs and fragment writes, especially when using ILP with mixed-source data.

Diagnostics and Monitoring Techniques

Inspect Query Plans

Use EXPLAIN to evaluate how the query interacts with partitions and indexes.

EXPLAIN SELECT * FROM trades WHERE timestamp >= date 'yesterday';

Look for full table scans or partition misalignment.

Enable Telemetry and Metrics

QuestDB exposes Prometheus metrics for:

  • questdb_sql_query_duration_seconds
  • questdb_writer_commits_total
  • questdb_memory_used_bytes

Correlate ingestion drops with memory or WAL commit latency spikes.

Profile WAL Throughput

Track WAL writer backlog with internal logging and ensure segment sizes are tuned:

wrapper.java.additional=-Dquestdb.wal.segment.size=512000

Step-by-Step Remediation Guide

Step 1: Limit Query Scope

Always restrict queries with time filters and avoid SELECT *. Instead, specify the exact columns and narrow time ranges.

SELECT timestamp, price FROM trades WHERE timestamp > now() - INTERVAL '1 hour';

Step 2: Re-evaluate Partitioning Granularity

For ultra-high ingest rates (e.g., tick data), consider hourly partitions:

CREATE TABLE ticks(timestamp TIMESTAMP, ... ) timestamp(timestamp) PARTITION BY HOUR;

For sparse data, monthly partitioning is more efficient.

Step 3: Increase WAL Segment Size and Threads

Update your server environment or docker container with:

export QDB_WAL_APPLY_WORKER_POOL_SIZE=8
export QDB_WAL_SEGMENT_SIZE=1048576

Step 4: Optimize JVM Memory Settings

Allocate more heap and enable GC tuning in questdb.sh or Docker env:

-Xmx8g -Xms4g -XX:+UseG1GC

Step 5: Serialize Inserts or Apply Batching

Use ILP batching to reduce out-of-order cost. Avoid concurrent inserts with identical timestamps.

line protocol:
sensors,location=LA temp=21.1 1691619117000000000
sensors,location=LA temp=21.2 1691619118000000000

Best Practices for Long-Term Stability

  • Always use time filters in production queries
  • Prefer WAL mode for critical ingestion reliability
  • Regularly clean up or archive cold partitions
  • Set WAL apply worker pools to match CPU cores
  • Monitor ingestion lag and query latency via Prometheus
  • Pre-create schemas with partition strategy aligned to workload

Conclusion

QuestDB's exceptional ingestion and query speed make it ideal for real-time applications, but this performance depends on precision in schema design, query filtering, and memory management. Most ingestion and query timeout issues arise from unbounded queries, misaligned partitions, or under-tuned WAL configurations. With the right monitoring and proactive configuration, teams can maintain millisecond-level performance even with billions of rows.

FAQs

1. Can I change the partitioning on an existing table?

No, partitioning is defined at table creation. To change it, you must export data, drop and recreate the table, then re-ingest.

2. How do I detect ingestion lag?

Use Prometheus metrics or logs to track WAL writer delays, ingestion errors, or compare timestamp lag between latest insert and system time.

3. What is the default WAL segment size?

Default is 512KB. This may be too small for high-velocity streams. Increase to 1MB–4MB for better throughput on multicore systems.

4. Are out-of-order writes always problematic?

Not always, but high volume out-of-order data forces reindexing and slows ingestion. Sort data before writing or increase max lag threshold.

5. Is ILP faster than WAL mode?

Yes, ILP offers lower latency at the cost of durability. Use it when ingest speed is critical and failure tolerance is acceptable.