Apache Hadoop: Architectural Overview
Core Components
- HDFS (Hadoop Distributed File System): Stores data across multiple nodes with fault tolerance via replication.
- YARN (Yet Another Resource Negotiator): Manages cluster resources and schedules jobs.
- MapReduce: Framework for distributed data processing.
- Hive, Pig, HBase, and Spark: Common tools running atop Hadoop infrastructure.
Common Troubleshooting Scenarios
1. Slow or Stuck MapReduce Jobs
Jobs may hang or run slowly due to data skew, poorly tuned mappers/reducers, unbalanced partitioning, or resource contention in YARN.
yarn logs -applicationId application_1683xxxx_0004
2. Namenode Memory Pressure
HDFS Namenode stores file system metadata in memory. Large numbers of small files can overwhelm heap space, leading to out-of-memory errors or long GC pauses.
jstat -gcutil NAMENODE_PID 1s
3. Datanode Disk Failures
Failed disks can trigger block replication storms and reduce available HDFS throughput. Check DataNode logs and hardware monitoring tools.
hdfs dfsadmin -report
4. YARN Resource Starvation
Improper configuration of yarn.scheduler.minimum-allocation-mb
and maximum-allocation-mb
can cause jobs to be stuck in ACCEPTED state indefinitely.
5. Job History Server Inconsistencies
Missing job history files or slow UI loading can stem from misconfigured log aggregation or exhausted local storage on HistoryServer nodes.
Diagnostics: Where to Begin
Job-Level Debugging
Start with the ResourceManager UI or CLI to check job status and attempt tracking through the ApplicationMaster logs.
mapred job -status job_1683xxxx_0004 yarn application -status application_1683xxxx_0004
Node Health Checks
Use:
yarn node -list hdfs dfsadmin -report jps
to detect dead nodes, unresponsive NodeManagers, or stopped services.
GC and Heap Analysis
Run jmap, jstat, or Java Flight Recorder (JFR) to identify long GC pauses or memory leaks in Namenode/ResourceManager JVMs.
Data Skew Detection
Use job counters and input split logs to locate mappers or reducers processing disproportionate amounts of data.
mapred job -counter job_id org.apache.hadoop.mapreduce.TaskCounter REDUCE_INPUT_GROUPS
Best Practices to Prevent Failures
- Avoid millions of small files—use sequence or ORC files to batch data
- Ensure heap tuning for Namenode and ResourceManager (Xms/Xmx, GC type)
- Implement rack awareness to optimize HDFS replication
- Enable log aggregation and monitor HDFS audit logs
- Use YARN capacity scheduler with queues to avoid resource hogging
Conclusion
Apache Hadoop remains a backbone of large-scale data processing, but its complexity demands proactive diagnostics and informed configurations. By understanding failure domains across HDFS, YARN, and MapReduce layers—and by using logs, job counters, and JVM metrics—senior teams can prevent data loss, reduce job latency, and improve resource utilization. Building robust data pipelines means treating Hadoop not just as infrastructure, but as a living system requiring continuous tuning and observability.
FAQs
1. How can I prevent small file overload in HDFS?
Use Hive/ORC, sequence files, or batch ingestion pipelines to consolidate small records into larger blocks before writing to HDFS.
2. Why do some YARN jobs stay in ACCEPTED state?
Usually due to insufficient free resources, misconfigured memory allocations, or job queue contention in the YARN scheduler.
3. How do I analyze a slow-running reducer?
Check task logs and counters for skew, spilled records, or long sort phases. Hadoop counters like REDUCE_SHUFFLE_BYTES
are especially useful.
4. What causes namenode GC pauses?
Large metadata loads or heap misconfiguration can cause full GC pauses. Use G1GC or tune CMS to reduce stop-the-world GC times.
5. Can I run Hadoop in a cloud-native environment?
Yes, though it requires architectural changes. Tools like EMR, Dataproc, and Kubernetes operators (e.g., Hadoop Operator) can help manage Hadoop workloads in cloud-native stacks.