Understanding Apache Hadoop Architecture
HDFS and YARN Overview
HDFS (Hadoop Distributed File System) stores data in blocks across DataNodes, managed by a central NameNode. YARN (Yet Another Resource Negotiator) handles cluster resource management and job scheduling. Failures in either layer can cascade across the system.
MapReduce and Resource Allocation
MapReduce jobs are executed in containers managed by NodeManagers under the supervision of ResourceManager. Misconfigurations in memory, CPU, or speculative execution settings can lead to job slowdowns or repeated failures.
Common Apache Hadoop Issues in Production
1. NameNode High Availability and Failover Issues
If configured improperly, standby NameNodes may not sync edit logs, causing failovers to result in stale metadata or job losses.
ERROR org.apache.hadoop.hdfs.server.namenode.FSEditLog: Edit log corruption detected
2. MapReduce Job Failures or Timeouts
Jobs fail due to memory overflow, insufficient vCores, missing input splits, or incorrect permissions on HDFS paths.
3. HDFS DataNode Disk Failures
Disk errors or space exhaustion cause DataNode shutdowns, block replication alerts, or corrupt block errors in logs.
4. YARN Resource Starvation
Improper queue configuration, default container sizes, or node-level resource exhaustion can prevent new jobs from being scheduled.
5. Cluster Slowness Under Load
Job stragglers, inefficient joins, or I/O bottlenecks due to network saturation and small file problems can degrade overall throughput.
Diagnostics and Debugging Techniques
Analyze Job History and Logs
Access the ResourceManager UI or JobHistoryServer to inspect failed task logs, stderr/stdout output, and attempt counts.
Monitor HDFS Health
Use hdfs dfsadmin -report
and fsck
to validate block replication, under-replicated files, and dead DataNodes.
Track Resource Usage via Ambari or Cloudera Manager
Visualize node-level memory, CPU, disk, and network utilization. Investigate saturation patterns and container preemption rates.
Enable Debug-Level Logging
Modify log4j settings to increase log verbosity for specific daemons. Restart affected services after configuration changes.
Step-by-Step Resolution Guide
1. Fix NameNode Failover and EditLog Sync
Ensure shared edits directory is mounted and accessible. Run:
hdfs namenode -bootstrapStandby
Validate journal nodes and check ZKFC logs for failover attempts.
2. Resolve MapReduce Job Failures
Check mapreduce.map.memory.mb
and mapreduce.reduce.memory.mb
settings. Review application logs for Java heap errors or permission denials.
3. Recover from DataNode Disk Failures
Replace faulty disks, run hdfs datanode -rollback
if needed, and rebalance blocks with:
hdfs balancer -threshold 10
4. Adjust YARN Resource Allocation
Update yarn.scheduler.capacity.root.default.maximum-allocation-mb
and vcores
to match cluster size. Use preemption policies to ensure fair sharing.
5. Optimize for Performance and Scalability
Use CombineFileInputFormat to mitigate small file issues. Tune speculative execution via:
mapreduce.map.speculative=false mapreduce.reduce.speculative=false
Best Practices for Reliable Hadoop Operations
- Use NameNode HA with Quorum Journal Nodes for failover resilience.
- Consolidate small files using SequenceFile or HBase.
- Monitor disk space thresholds on all DataNodes and preemptively decommission failing nodes.
- Set container memory and vCore limits according to workload characteristics.
- Use capacity scheduler with defined queues to avoid resource contention.
Conclusion
Apache Hadoop provides scalable data processing capabilities but requires careful monitoring and tuning to ensure reliability. From NameNode availability to job execution tuning, addressing Hadoop issues demands visibility across HDFS, YARN, and MapReduce layers. By applying best practices in configuration, logging, and workload planning, DevOps and data engineering teams can maintain high-performing Hadoop clusters that support enterprise-scale analytics.
FAQs
1. Why is my Hadoop job stuck in the ACCEPTED state?
This usually indicates resource contention or improper queue configuration. Check YARN UI for pending resources and queue capacity limits.
2. How can I fix under-replicated HDFS blocks?
Run hdfs fsck / -blocks -locations
and rebalance the cluster. Add DataNodes if persistent under-replication exists.
3. What causes frequent NameNode failovers?
Network partitioning, ZKFC misconfigurations, or slow journal syncing. Review Zookeeper logs and HA health checks.
4. Why do my MapReduce tasks fail with OutOfMemory?
Task memory allocation is too low. Increase mapreduce.{map|reduce}.memory.mb
and java.opts
accordingly.
5. Can I upgrade Hadoop without downtime?
Rolling upgrades are possible with HA configured, but require version compatibility and cluster quiescence during transition.