ClickHouse Replication and MergeTree Internals

ReplicatedMergeTree Overview

Most distributed ClickHouse deployments use ReplicatedMergeTree or its derivatives. These engines replicate parts via ZooKeeper and perform background merges to consolidate data for efficient querying.

Important components include:

  • ZooKeeper: Tracks part metadata and replication logs
  • MergeTree: Handles storage, merges, and TTL rules
  • Mutations: Background data transformations

Common Failure Points

  • ZooKeeper connection instability
  • Excessive unmerged parts
  • Replica lag due to long merges or failed mutations

Symptoms of Replication or Merge Failures

1. Replica Shows "Not Synchronized" Status

ClickHouse logs or system.replicas table may show:

Replica is not active
Log pulling queue size: 1200
Last queue update: 20 minutes ago

2. Queries Return Inconsistent Results

Queries on distributed tables may return partial data if some shards or replicas are out-of-sync. The SELECT may silently omit lagging parts.

3. Merges Not Executed

Large part counts visible in system.parts indicate merge backlog:

SELECT table, count(*) FROM system.parts WHERE active = 1 GROUP BY table HAVING count(*) > 1000

Diagnostics and Root Cause Analysis

1. Check ZooKeeper Health

If replication log updates are stalled:

echo ruok | nc localhost 2181  # Should return imok
zkCli.sh -server <zk_host> ls /clickhouse/tables

Restarting ClickHouse while ZooKeeper is degraded can corrupt replica metadata.

2. Investigate Replica Lag

SELECT * FROM system.replicas WHERE is_session_expired OR future_parts > 0;

Future parts or queue size > 1000 usually indicate merge or mutation backlogs.

3. Examine Part Merge Failures

Check system logs:

grep MergeTree /var/log/clickhouse-server/clickhouse-server.log

Look for entries like:

Code: 253, e.displayText() = DB::Exception: Cannot merge parts...

Resolution Strategies

1. Restart Replica Safely

Stop the affected replica with system stop merges and system stop fetches before restarting to avoid race conditions.

2. Clear Invalid Queues

If a part is corrupted or stale:

ALTER TABLE table_name DROP PART 'part_id';

Or use ZooKeeper CLI to manually delete broken nodes.

3. Tune Merge Parameters

Modify settings to allow more aggressive merges:

max_bytes_to_merge_at_max_space_in_pool = 10000000000
parts_to_throw_insert = 200

Best Practices to Avoid Future Issues

1. Set Replica Priorities

Assign priorities to replicas to ensure consistent query results:

distributed_replica_priority = 10

2. Monitor with system.tables and system.replicas

Alert if absolute_delay > 300 or queue_size > 1000 for any replica.

3. Manage Part Explosion

Avoid creating too many small parts via frequent inserts. Use buffer tables or batch loads.

Conclusion

ClickHouse's high-speed architecture comes with replication and merge tradeoffs that require close observability. By proactively monitoring ZooKeeper health, part counts, replica lag, and system queues, teams can avoid data inconsistency and query degradation. Applying merge tuning and safe restart protocols ensures that ClickHouse remains resilient even in high-ingest, distributed workloads.

FAQs

1. Why is my ClickHouse replica stuck in 'log pulling'?

Likely due to ZooKeeper disconnection, part merge failures, or excessive backlog. Check logs and system.replicas queue size.

2. Can I run ClickHouse without ZooKeeper?

Only for non-replicated engines. Distributed fault-tolerant setups require ZooKeeper or Keeper for coordination.

3. How do I prevent part count explosion?

Use batch inserts instead of frequent single-row inserts. Target fewer than 300 active parts per shard where possible.

4. How can I force merges on a table?

Use OPTIMIZE TABLE table FINAL. Note this is synchronous and can be expensive for large tables.

5. What's the best way to monitor ClickHouse replica health?

Regularly query system.replicas, alert on high queue size or delay, and monitor merge activity in system.merges.