Using Yarn to Schedule Jobs & Manage Resources (CPU & Memory) in a Hadoop / Spark Cluster


Yarn is a major resource management improvement for Hadoop MapReduce & Spark Jobs. Once understood and configured for a workload and cluster size, its benefits will be obvious. The Yarn architecture is powerful and designed for future enhancements. This adds some complexity to the architecture to allow for future additions. As with most enhancements, it has a vast collection of parameters to learn and adjust per the workload and cluster. We will attempt to demystify Yarn, allowing a Hadoop / Spark Cluster to harness its power and potential.

INTRODUCTION
Applications that execute on a Hadoop / Spark cluster can be scheduled and executed using Yarn. CPU & memory resources provided by servers in a cluster can be managed and allocated by Yarn. Configured correctly, Yarn can help ensure that limited resources within a cluster (CPU & Memory) are not overcommitted. This allows jobs to execute without resource exhaustion and limited contention. These abilities can allow job durations to reduce and more jobs to be completed.

Note: Yarn is “currently” able to manage CPU & Memory resources within a cluster. Yarn may one day manage network and IO bandwidth resources. This would allow Yarn to evaluate every possible resource within a cluster before dispatching work to a node.

To avoid exceeding CPU & Memory resources within a cluster, Yarn limits the amount of work allowed to execute on a cluster and nodes within a cluster. When a new application (also called a job, task, or work) is submitted, Yarn checks available nodes for requested amounts of CPU and memory resources. If the new request plus executing jobs exceeds defined limits, the new job request will be postponed or fail.

Yarn also attempts to allocate any new work on the best possible node. To do this, it looks for a node with available resources, that is healthy, with local access to desired data. Once a node is selected, it begins the process of allocating CPU & Memory resources on the node prior to starting the application execution. If during this process (Node Manager startup), necessary CPU & Memory resources are no longer available, Yarn will postpone or fail the job to avoid exceeding limits.

 

SCHEDULING an application for execution

Yarn’s application scheduler is pluggable. Meaning new types of schedulers can be added to Yarn. Currently, Yarn has two types of schedulers – CapacityScheduler and FairScheduler.

Note: We will address FairScheduler in a future update.

CAPACITY SCHEDULER

The CapacityScheduler allocates and executes work on a node in the cluster based on the nodes ability to execute the workload. It decides if a node in the cluster has the necessary resources based on the Resource Calculator. The Yarn Resource Calculator is also a pluggable layer that can be improved. Currently, the CapacityScheduler has two Resource Calculators – DefaultResourceCalculator & DominantResourceCalculator.

DEFAULT RESOURCE CALCULATOR
The DefaultResourceCalculator only examines memory required by requests and memory available on the node. By default, CPU requirements and available CPU on the node are not used.

DOMINANT RESOURCE CALCULATOR
The DominantResourceCalculator takes available CPU & Memory resources into account. To decide which resource has priority (CPU or Memory) the Dominant Resource Fairness (DRF) policy is used.

MEMORY RESOURCES

Node Manager Configuration (yarn-site.xml)
Yarn physical memory limit per node in cluster (all contains on node): 
yarn.nodemanager.recource.memory-mb=10240 (default)

Yarn minimum memory allocation per container: 
yarn.scheduler.capacity.minimum-allocation-mb=1024 (default)

Yarn virtual memory limit per node: equals 2.1 times physical memory
map tasks 2140.4
reduce tasks 3225.6

Application (job, work, client) JVM sizes
Mapper heap size: mapreduce.map.java.opts=-Xmx1G (default)
Reducer head size: mapreduce.reduce.java.opts=-Xmx1G (default)

Yarn Container sizes
Mapper container size: mapreduce.map.memory.mb=1536
(default - mapreduce.map.java.opts+overhead (usually 512MB))

Reducer container size: mapreduce.reduce.memory.mb=1536 
(default - mapreduce.reduce.java.opts+overhead (usually 512MB))

CPU RESOURCES

CPU resources are managed by Yarn after enabling configuration properties.

In capacity-scheduler.xml, set the Capacity Resource Calculator

scheduler.capacity.resource-calculator=org.apache.hadoop.yarn.util.resource.DominantResourceCalculator

Next, adjust nodemanager.resource.cpu-vcores in yarn-site.xml on each node in a cluster.

Intel Servers – For Intel servers using CPUs with hyperthreading (enabled in the BIOS) start with vcores set to (number of physical cores + (number of physical cores * 0.30). Execute the planned workload on the cluster and evaluate peek CPU usage. Reduce vcores, if physical CPU cores are over 80% busy, and the runqueue is growing. Increase vcores, if cores are < 80% busy and the runqueue is near empty. The value of vcores on each node should be re-adjusted when the server hardware or workload changes.

When Node Managers start, each reports its vcores value to the Resource Manager which in turn uses it
to make calculations for CPU resources within the cluster. Applications also specify the number of desired vcores as part of their allocation-requests to inform the Resource Manager about CPU resource requirements.

Each Application provides a desired number of vcores to Yarn using:

map.cpu.vcores: (default 1) sets the desired number of vcores for each map task.

reduce.cpu.vcores: (default 1) sets the desired number of vcores for each reduce task.

yarn.app.mapreduce.am.resource.cpu-vcores: (default 1) sets the number of vcores used by the MapReduce Application Master process.