Architectural Background

Distributed System Components

SingleStore clusters comprise aggregator nodes (coordinators) and leaf nodes (data storage/execution engines). Aggregators parse queries and distribute work to the leaf nodes. Poor shard-to-leaf distribution or hot partitions can result in skewed workloads and reduced parallelism.

HTAP Workload Sensitivity

SingleStore supports hybrid workloads (OLTP + OLAP), but heavy concurrent DML and SELECT queries can interfere, especially when memory grants and concurrency controls are not tuned properly.

Common Symptoms in Production

  • High latency on simple SELECT or INSERT operations
  • One or more leaf nodes showing consistent CPU/memory saturation
  • Replication lag or pipeline backpressure
  • Degraded performance after data rebalancing or failover

Root Cause Diagnosis

1. Query Plan Analysis

Use EXPLAIN and PROFILE to detect bottlenecks in query execution. Look for signs of:

  • Broadcast joins on large datasets
  • Imbalanced query fan-out to leaf nodes
  • Subqueries without proper indexing
PROFILE SELECT * FROM orders WHERE customer_id = 12345;

2. Shard Skew and Data Distribution

Shard skew happens when certain partitions accumulate more data or traffic. Use system views to inspect distribution:

SELECT database_name, table_name, shard_id, row_count 
FROM information_schema.shard_rows; 

3. Pipeline and Ingestion Bottlenecks

Ingest pipelines can create backpressure if downstream queries or indexes are slow. Review pipeline lag:

SHOW PIPELINES; 
SHOW PIPELINE STATUS;

Troubleshooting Guide

  1. Profile Problem Queries: Run PROFILE and analyze time spent per operator (scan, join, network exchange).
  2. Check Shard Balance: Use shard_rows and shard_cpu views to detect hot partitions.
  3. Review Indexing: Ensure that filters and joins leverage covering indexes; otherwise table scans will dominate.
  4. Audit Pipelines: Identify lag sources and reduce ingest-to-query coupling.
  5. Monitor Resource Contention: Track memory pool usage, thread pool wait times, and node temperature.

Architectural Pitfalls to Avoid

Shard Key Selection

Choosing a non-uniform or non-access-pattern-aware shard key can cripple distribution. Always align shard keys with query predicates and join keys.

Overuse of Pipelines Without Backpressure Control

Pipelines that ingest large volumes without consumer-side throttling can overrun memory and destabilize nodes. Implement batching and monitoring.

Aggregator Bottlenecks

Co-locating aggregators with data nodes or under-provisioning CPU/RAM for aggregators can delay query parsing and fan-out.

Best Practices

  • Use UNIQUE KEY SHARD for high-cardinality identifiers
  • Continuously monitor mv_memory_usage and mv_query_history views
  • Distribute load using round-robin ingestion or load balancers
  • Avoid cross-database joins unless absolutely necessary
  • Run OPTIMIZE TABLE periodically to reclaim storage and rebalance stats
  • Use workload management rules to isolate ingest vs. analytics queries

Conclusion

While SingleStore delivers impressive performance under the right conditions, achieving consistent scalability in enterprise workloads requires deep awareness of sharding strategies, query patterns, and resource flows. Query profiling, shard balancing, and proper ingestion planning form the cornerstone of stable deployments. Treating SingleStore as a distributed system—with its own inter-node dependencies and planner nuances—is key to sustainable performance.

FAQs

1. Why does a single node always show higher CPU usage?

It's likely a shard skew issue where one partition handles more data or traffic. Reevaluate the shard key and check distribution metrics.

2. How do I prevent pipeline lag in high-throughput ingestion?

Throttle input, batch records, and use well-indexed target tables. Monitor pipeline_lag and retry_count for signs of bottlenecks.

3. Can I improve join performance without changing schema?

Yes, by using hash joins with session hints and ensuring both join sides have indexed keys. Also, filter early to reduce join input size.

4. Why do queries slow down after table rebalancing?

Post-rebalance, data locality may be affected or new leaf nodes may need to warm caches. Check network and memory metrics across nodes.

5. What's the best way to monitor shard distribution?

Use system views like shard_rows, shard_cpu, and leaf_status to visualize distribution and workload skew across shards.