Hadoop & Spark Yarn Setting Recommendations


Recommended YARN settings for Hadoop & Spark

yarn.nodemanager.resource.cpu-vcores=12
yarn.nodemanager.resource.memory-mb=84000
yarn.nodemanager.resource.percentage-physical-cpu-limit=100
yarn.nodemanager.vmem-pmem-ratio=5
yarn.scheduler.maximum-allocation-mb=8192
Adjust these settings based on the most memory intensive workload(s) anticipated on the cluster. Should be a value less than or equal to yarn.nodemanager.resource.memory-mb.
yarn.scheduler.minimum-allocation-mb=2048

Can theoretically be lower if running jobs use less JVM memory.

mapreduce.map.java.opts=-Xmx3072m
mapreduce.map.memory.mb=4096
mapreduce.reduce.java.opts=-Xmx6144m
mapreduce.reduce.memory.mb=8192

The current values imply running memory intensive workloads. The maximum number of concurrent map tasks on a compute node is 84000/4096 = 20 tasks. Maximum number of reduce tasks on a compute node is 84000/8192 = 10.

yarn.app.mapreduce.am.command-opts=-Xmx1638m
yarn.app.mapreduce.am.resource.mb=2048


HDFS data directories and YARN local and log directories

Recommend spreading over multiple disks/ mount points

yarn-site.xml

<property>
 <name>yarn.nodemanager.local-dirs</name>
 <value>/rdm/hadoop/yarn/local</value>
</property>
<property>
 <name>yarn.nodemanager.log-dirs</name>
 <value>/var/hadoop/yarn/log</value>
</property>

YARN uses local storage for containers as well as log space. Testing shows spreading these across several disks can increase throughput.

Leave a comment