Troubleshooting Cassandra: Fixing Tombstone Accumulation, Read Latency, Repair Failures, Disk Issues, and Config Drift

Details: Category: Databases; By Mindful Chase; 20.Apr; Hits: 350

Apache Cassandra is a distributed NoSQL database known for its high availability, linear scalability, and fault-tolerant architecture. It excels in scenarios involving large volumes of data and write-intensive workloads. However, operating Cassandra at scale introduces complex challenges such as tombstone accumulation, inconsistent reads, repair issues, disk space pressure, and configuration drift across nodes. This article provides a comprehensive troubleshooting guide aimed at resolving critical Cassandra issues in enterprise-grade environments.

Mindful Chase

Writing Code, Writing Stories

tbd

Experience

tbd

More to Explore

Understanding Cassandra Architecture

Masterless Ring and Replication

Cassandra uses a masterless, peer-to-peer architecture. Each node is equal, with data distributed across the ring via consistent hashing and replicated based on a configurable replication factor.

Tunable Consistency and Hinted Handoff

Read and write operations can specify consistency levels (e.g., ONE, QUORUM, ALL), allowing trade-offs between availability and consistency. Hinted handoff temporarily stores writes for unavailable nodes.

Common Cassandra Issues

1. Read Latency and Inconsistency

Caused by out-of-sync replicas, read repairs lagging, or excessive tombstones. QUORUM reads may return outdated data if consistency is not enforced across nodes.

2. Tombstone Accumulation

Deletes in Cassandra create tombstones, which can accumulate and degrade performance. Large partitions with many tombstones trigger TombstoneOverwhelmingException during reads.

3. Disk Space Pressure

Improper compaction, uncontrolled tombstones, and large SSTables consume disk rapidly. Nodes under pressure may trigger write failures or throttle background tasks.

4. Repair and Rebuild Failures

Manual or automatic repairs may stall or fail due to network issues, inconsistent token ranges, or schema mismatches across nodes. Rebuilds may hang on large clusters without proper throttling.

5. Configuration Drift Across Nodes

Inconsistent cassandra.yaml files or misaligned Java versions across the cluster can cause gossip issues, token mismanagement, and service instability.

Diagnostics and Debugging Techniques

Use `nodetool` for Node Health and Status

Run nodetool status to check up/down state, load, and tokens. Use nodetool info to review disk usage, compactions, and heap.

Monitor Read/Write Latencies

Access JMX metrics or use tools like Prometheus + Grafana for tracking ReadLatency, WriteLatency, and PendingTasks.

Query for Tombstone Density

Use cqlsh with TRACING ON or SELECT * FROM ... with ALLOW FILTERING to detect high tombstone partitions.

Verify Repair Integrity

Run nodetool repair with --full and review logs for exceptions. Compare schema versions using nodetool describecluster.

Audit Config Files with Hashing

Use scripts to compute and compare checksums of cassandra.yaml, jvm.options, and startup scripts across nodes for configuration drift.

Step-by-Step Resolution Guide

1. Resolve Read Latency and Inconsistencies

Use nodetool repair regularly. Tune read consistency level to QUORUM or LOCAL_QUORUM. Avoid using ALLOW FILTERING in production queries.

2. Manage Tombstones Efficiently

Adjust gc_grace_seconds only after evaluating anti-entropy repair timelines. Limit use of deletes on wide partitions. Use TTLs strategically.

3. Free Disk Space via Compaction

Trigger nodetool compact on overloaded tables. Monitor compactionstats. Review compaction_throughput_mb_per_sec to ensure balanced background I/O.

4. Repair and Rebuild with Throttling

Throttle repair with -pr and --parallelism=sequential. For rebuilds, isolate streaming targets and avoid mixing with repair workloads.

5. Eliminate Configuration Drift

Standardize deployment using automation tools like Ansible or Puppet. Enforce versioning for configuration files and validate JVM consistency across nodes.

Best Practices for Cassandra Operations

Run nodetool repair weekly or automate via cassandra-reaper.
Distribute data evenly with proper partition keys and avoid hotspots.
Use Prometheus exporters or DataStax metrics collector for visibility.
Test compaction and upgrade strategies in staging before production rollout.
Use LCS (Leveled Compaction Strategy) for read-heavy workloads with small rows.

Conclusion

Cassandra delivers high throughput and fault tolerance, but its distributed nature introduces operational complexity. Most production issues stem from unoptimized data models, misconfigured compaction, or failure to maintain cluster health through repair and monitoring. With consistent diagnostics using nodetool, metric instrumentation, and disciplined configuration management, teams can effectively troubleshoot and stabilize Cassandra deployments at scale.

FAQs

1. Why is Cassandra showing high read latency?

Check for tombstone-heavy queries, inconsistent replicas, or insufficient page size. Use tracing and nodetool metrics to isolate the cause.

2. What causes `TombstoneOverwhelmingException`?

Excessive tombstones in a partition. Avoid frequent deletes and use TTLs or partition pruning strategies to mitigate.

3. How do I safely run repairs?

Run nodetool repair --full --parallelism=sequential. Use cassandra-reaper for automated, throttled repairs across nodes.

4. How can I detect configuration drift?

Hash key files like cassandra.yaml across nodes and compare. Use CM tools like Ansible to enforce uniform settings.

5. What’s the best compaction strategy?

Use LCS for read-heavy workloads, STCS for write-heavy workloads, and TWCS for time-series data. Choose based on access patterns and row size.

Contact Us