ClickHouse Replication and MergeTree Internals
ReplicatedMergeTree Overview
Most distributed ClickHouse deployments use ReplicatedMergeTree
or its derivatives. These engines replicate parts via ZooKeeper and perform background merges to consolidate data for efficient querying.
Important components include:
- ZooKeeper: Tracks part metadata and replication logs
- MergeTree: Handles storage, merges, and TTL rules
- Mutations: Background data transformations
Common Failure Points
- ZooKeeper connection instability
- Excessive unmerged parts
- Replica lag due to long merges or failed mutations
Symptoms of Replication or Merge Failures
1. Replica Shows "Not Synchronized" Status
ClickHouse logs or system.replicas
table may show:
Replica is not active Log pulling queue size: 1200 Last queue update: 20 minutes ago
2. Queries Return Inconsistent Results
Queries on distributed tables may return partial data if some shards or replicas are out-of-sync. The SELECT
may silently omit lagging parts.
3. Merges Not Executed
Large part counts visible in system.parts
indicate merge backlog:
SELECT table, count(*) FROM system.parts WHERE active = 1 GROUP BY table HAVING count(*) > 1000
Diagnostics and Root Cause Analysis
1. Check ZooKeeper Health
If replication log updates are stalled:
echo ruok | nc localhost 2181 # Should return imok zkCli.sh -server <zk_host> ls /clickhouse/tables
Restarting ClickHouse while ZooKeeper is degraded can corrupt replica metadata.
2. Investigate Replica Lag
SELECT * FROM system.replicas WHERE is_session_expired OR future_parts > 0;
Future parts or queue size > 1000 usually indicate merge or mutation backlogs.
3. Examine Part Merge Failures
Check system logs:
grep MergeTree /var/log/clickhouse-server/clickhouse-server.log
Look for entries like:
Code: 253, e.displayText() = DB::Exception: Cannot merge parts...
Resolution Strategies
1. Restart Replica Safely
Stop the affected replica with system stop merges
and system stop fetches
before restarting to avoid race conditions.
2. Clear Invalid Queues
If a part is corrupted or stale:
ALTER TABLE table_name DROP PART 'part_id';
Or use ZooKeeper CLI to manually delete broken nodes.
3. Tune Merge Parameters
Modify settings to allow more aggressive merges:
max_bytes_to_merge_at_max_space_in_pool = 10000000000 parts_to_throw_insert = 200
Best Practices to Avoid Future Issues
1. Set Replica Priorities
Assign priorities to replicas to ensure consistent query results:
distributed_replica_priority = 10
2. Monitor with system.tables and system.replicas
Alert if absolute_delay > 300
or queue_size > 1000
for any replica.
3. Manage Part Explosion
Avoid creating too many small parts via frequent inserts. Use buffer tables or batch loads.
Conclusion
ClickHouse's high-speed architecture comes with replication and merge tradeoffs that require close observability. By proactively monitoring ZooKeeper health, part counts, replica lag, and system queues, teams can avoid data inconsistency and query degradation. Applying merge tuning and safe restart protocols ensures that ClickHouse remains resilient even in high-ingest, distributed workloads.
FAQs
1. Why is my ClickHouse replica stuck in 'log pulling'?
Likely due to ZooKeeper disconnection, part merge failures, or excessive backlog. Check logs and system.replicas
queue size.
2. Can I run ClickHouse without ZooKeeper?
Only for non-replicated engines. Distributed fault-tolerant setups require ZooKeeper or Keeper for coordination.
3. How do I prevent part count explosion?
Use batch inserts instead of frequent single-row inserts. Target fewer than 300 active parts per shard where possible.
4. How can I force merges on a table?
Use OPTIMIZE TABLE table FINAL
. Note this is synchronous and can be expensive for large tables.
5. What's the best way to monitor ClickHouse replica health?
Regularly query system.replicas
, alert on high queue size or delay, and monitor merge activity in system.merges
.