Redis Architecture: A Brief Overview
Single Node vs. Sentinel vs. Cluster
- Single Node: Simple to deploy but offers no HA or partitioning.
- Redis Sentinel: Provides automatic failover and monitoring but not sharding.
- Redis Cluster: Adds data partitioning with multiple masters, but introduces complexity in slot management and consistency guarantees.
Common Enterprise Deployment Patterns
- HA via Redis Sentinel with three or more sentinels
- Horizontal scale via Redis Cluster with hash slot distribution
- Ephemeral storage via Kubernetes StatefulSets or managed services (e.g., AWS ElastiCache)
Critical Symptoms and Root Causes
Symptom: Sudden Key Evictions or Data Loss
This typically results from maxmemory
policies kicking in. If maxmemory-policy
is set to volatile-lru or allkeys-lru and memory consumption hits the cap, Redis will start evicting keys—even critical ones—without clear warnings.
Symptom: Failover Loops in Sentinel
Caused by network partitions or slow disk I/O on the master. Sentinels detect master as unreachable, trigger a failover, but frequent topology changes lead to consistency issues, especially when clients connect to outdated endpoints.
Symptom: High CPU or Latency Spikes
Usually traced back to large keys, misuse of data types (e.g., massive lists), blocking commands like KEYS
or FLUSHALL
, or slow clients holding back replication or pub/sub propagation.
Diagnostics and Monitoring Techniques
Inspect Memory and Eviction Metrics
INFO memory
Review used_memory_peak
, evicted_keys
, and maxmemory
. Use MEMORY STATS
and MEMORY USAGE <key>
to locate oversized entries.
Track Sentinel Behavior
redis-cli -p 26379 INFO Sentinel
Identify frequent failovers, quorum loss, or flapping leader election. Cross-reference with system logs and client connection retries.
Analyze Latency Distribution
LATENCY LATEST
Reveals spikes in command execution time. For historical profiling, use LATENCY HISTORY <event>
(e.g., fork
, command
).
Review Keyspace Access Patterns
MONITOR
Useful for real-time debugging but expensive in production. Use SLOWLOG
to trace slow commands without full command visibility.
Step-by-Step Remediation
1. Prevent Unintended Key Evictions
- Set an appropriate
maxmemory
based on your instance size - Choose safer eviction policies like
noeviction
in critical systems - Audit TTLs and large keys regularly
2. Stabilize Sentinel-Based HA
- Ensure all Sentinels can reach each other over low-latency links
- Set
down-after-milliseconds
andfailover-timeout
to conservative values - Use client libraries that support auto-discovery of new masters
3. Eliminate Performance Bottlenecks
- Avoid blocking commands like
KEYS *
orHGETALL
on large hashes - Use pipelining or batching to reduce round-trips
- Monitor and disconnect slow clients
4. Repair Cluster Inconsistencies
Use:
redis-cli --cluster check <host>:<port>
and:
redis-cli --cluster fix <host>:<port>
Ensure slots are fully covered, and replicas are correctly assigned. Always back up before making changes.
Best Practices for Redis in Production
- Use RDB + AOF hybrid persistence to balance durability and recovery speed
- Enable active defragmentation to avoid memory fragmentation stalls
- Apply rate-limiting at the application layer to avoid thundering herds
- Implement circuit breakers in client libraries to fail gracefully
- For clusters, use DNS-based discovery and consistent hashing
Conclusion
Redis is deceptively simple but operationally intricate when used at scale or under high availability configurations. Common failures—from unexpected key evictions to chaotic failovers—often stem from incorrect assumptions about Redis behavior under pressure. Through robust monitoring, well-tuned configurations, and awareness of architectural nuances, Redis can reliably serve latency-sensitive, mission-critical applications. Senior engineers must treat Redis as a distributed system component, not just a cache, and architect accordingly.
FAQs
1. Why do Redis keys disappear even without TTLs?
If maxmemory-policy
is set to evict keys and memory pressure builds, Redis will evict keys regardless of TTL settings unless configured with noeviction
.
2. How do I prevent Sentinel from flapping?
Ensure stable network links between Sentinel nodes and set conservative down-after-milliseconds
. Also, monitor disk I/O latency, which affects master heartbeat visibility.
3. Can I use Redis MONITOR safely in production?
It is highly discouraged in production due to its blocking nature. Instead, use SLOWLOG
or dedicated tracing agents for command profiling.
4. What's the best persistence strategy for Redis?
Combining RDB snapshots with AOF provides fast recovery and durability. For critical systems, use AOF with appendfsync always
and replicate data to secondary nodes.
5. How do I debug Redis Cluster slot migration issues?
Use redis-cli --cluster check
and --cluster fix
to ensure full slot coverage and no orphaned keys. Monitor rebalancing operations to avoid data loss during migrations.