Introduction

Kafka provides high availability through partition replication and automatic leader election, but broker failures, improper replication configurations, and inefficient cluster settings can lead to prolonged under-replicated partitions, slow failovers, and potential data loss. Common pitfalls include setting an insufficient replication factor, misconfigured minimum in-sync replicas, improperly tuned broker timeouts, frequent broker restarts causing instability, and inefficient partition reassignment. These issues become particularly problematic in large-scale event-driven architectures where Kafka uptime and message durability are critical. This article explores Kafka under-replicated partitions, leader election failures, and best practices for maintaining cluster health.

Common Causes of Under-Replicated Partitions and Leader Election Failures

1. Insufficient Replication Factor Reducing Fault Tolerance

Setting a low replication factor reduces resilience to broker failures.

Problematic Scenario

bin/kafka-topics.sh --create --topic orders --partitions 6 --replication-factor 1 --bootstrap-server localhost:9092

A replication factor of `1` means no failover if the leader broker crashes.

Solution: Increase Replication Factor

bin/kafka-topics.sh --alter --topic orders --replication-factor 3 --bootstrap-server localhost:9092

Increasing the replication factor ensures redundancy and availability.

2. Misconfigured Minimum In-Sync Replicas (ISR) Causing Data Loss

Setting an insufficient ISR count allows Kafka to acknowledge writes with too few replicas.

Problematic Scenario

min.insync.replicas=1

Allowing writes with only one in-sync replica increases the risk of data loss if a broker fails.

Solution: Increase `min.insync.replicas` to Ensure Replication

min.insync.replicas=2

Requiring at least two replicas in sync prevents acknowledged writes from being lost.

3. Frequent Broker Failures Causing Unstable Leadership

Unstable brokers frequently go down, leading to under-replicated partitions.

Problematic Scenario

log.retention.ms=60000

Setting an excessively low log retention time can cause storage overload and broker crashes.

Solution: Optimize Log Retention and Monitoring

log.retention.ms=604800000

Setting a higher log retention time reduces unnecessary broker churn.

4. Inefficient Leader Election Slowing Failovers

Kafka can take a long time to elect a new leader if Zookeeper settings are not optimized.

Problematic Scenario

zookeeper.session.timeout.ms=60000

Using a long Zookeeper timeout causes delays in detecting broker failures.

Solution: Optimize Zookeeper Timeout for Faster Failover

zookeeper.session.timeout.ms=15000

Reducing the timeout speeds up failure detection and leader election.

5. Partition Reassignment Bottlenecks Slowing Recovery

Improper partition reassignment strategies can cause excessive broker load.

Problematic Scenario

bin/kafka-reassign-partitions.sh --zookeeper localhost:2181 --reassignment-json-file partitions.json --execute

Reassigning all partitions simultaneously can overload the brokers.

Solution: Use Throttled Partition Reassignment

bin/kafka-reassign-partitions.sh --zookeeper localhost:2181 --reassignment-json-file partitions.json --throttle 50000000 --execute

Throttling reassignment prevents excessive load on the cluster.

Best Practices for Maintaining Kafka Cluster Stability

1. Increase Replication Factor for Redundancy

Ensure fault tolerance by configuring proper replication.

Example:

bin/kafka-topics.sh --alter --topic orders --replication-factor 3

2. Optimize Minimum In-Sync Replicas

Prevent data loss by ensuring multiple replicas are required for writes.

Example:

min.insync.replicas=2

3. Monitor and Stabilize Broker Uptime

Prevent frequent failures by optimizing log retention and monitoring.

Example:

log.retention.ms=604800000

4. Optimize Leader Election Timing

Ensure fast failovers with properly tuned Zookeeper settings.

Example:

zookeeper.session.timeout.ms=15000

5. Use Throttled Partition Reassignment

Prevent excessive broker load when redistributing partitions.

Example:

bin/kafka-reassign-partitions.sh --throttle 50000000 --execute

Conclusion

Kafka under-replicated partitions and leader election failures often result from insufficient replication, improper in-sync replica settings, frequent broker failures, inefficient Zookeeper timeouts, and excessive partition reassignment. By increasing replication factors, optimizing minimum in-sync replicas, stabilizing brokers, tuning leader election timeouts, and using throttled reassignment, developers can significantly improve Kafka cluster resilience. Regular monitoring using `kafka-topics.sh --describe`, `kafka-consumer-groups.sh`, and `Kafka Metrics Exporter` helps detect and resolve cluster stability issues before they impact production.