Recommended tuning for DFSIO / Teragen / Terasort


Useful Linux command to gather detailed hardware information

# lshw

Note: Redhat does not come with lshw installed by default, use the following command to install lshw

# yum install lshw

Recommended tuning for DFSIO / Teragen / Terasort

NOTE: Parms that MUST be adjusted for the environment are contained within [ ?]

BIOS
Enable MAX PERFORMANCE which should disable power savings on CPU, QPI and Memory
Note: For a e5-2600 v3 series processor the QPI should be 9.6GT, Memory should be 2133MHz
Enable Hyper threading

Kernel 

sysctl -w vm.swappiness=5
sysctl -w vm.dirty_background_ratio=5
sysctl -w vm.dirty_ratio=40
sysctl -w vm.dirty_writeback_centisecs=500
sysctl -w vm.dirty_expire_centisecs=100
echo never > /sys/kernel/mm/redhat_transparent_hughpage/enable

TCPIP

ifconfig [eth?] mtu 9000 up
ifconfig [eth?] txqueuelen 100000
sysctl -w net.ipv4.tcp_timestamps=0
sysctl -w net.ipv4.tcp_tw_reuse=1
sysctl -w net.ipv4.tcp_tw_recycle=1
sysctl -w net.ipv4.tcp_sack=0
sysctl -w net.ipv4.tcp_dsack=0
sysctl -w net.core.netdev_max_backlog=250000
sysctl -w net.core.optmem_max=16777216
sysctl -w net.ipv4.tcp_keepalive_probes=5
sysctl -w net.ipv4.tcp_keepalive_intvl=15
sysctl -w net.ipv4.tcp_retries2=2
sysctl -w net.ipv4.tcp_orphan_retries=1
sysctl -w net.ipv4.tcp_reordering=5
sysctl -w net.ipv4.tcp_retrans_collapse=0
sysctl -w net.core.wmem_max=16777216
sysctl -w net.core.rmem_max=16777216
sysctl -w net.core.wmem_default=16777216
sysctl -w net.core.rmem_default=16777216
sysctl -w net.ipv4.tcp_rmem="4096 87380 16777216"
sysctl -w net.ipv4.tcp_wmem="4096 87380 16777216"

IO Scheduler

blockdev --setra 8192 [/dev/sd?]

IO Elevator

echo cfs > /sys/block/[sd?]_range/queue/scheduler

File systems
EXT3/4

tune2fs -O ^has_journal,dir_index,extent -o journal_data_writeback,nobarrier,noatime,nodiratime,nobh,nouser_xattr,data=writeback,commit=100 [/dev/sd?]

Hadoop Scheduler

Configuring Fair Scheduler within yarn-site.xml

<property>
 <name>yarn.resourcemanager.scheduler.class</name>
 <value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler</value>
</property>

When using Ambari, do not update .xml configuration files directly.

Re-starting a service with Ambari causes the .xml configuration file to be rewriten from settings in the Ambari database.
To change YARN properties go to YARN->Configs->Advanced->Scheduler in Ambari settings.

Cleaning an HDFS directory prior to a Teragen / Terasort run

/usr/bin/hdfs dfs -rm -r -skipTrash [/user/ambari-qa/in-dir]
/usr/bin/hdfs dfs -rm -r -skipTrash [/user/ambari-qa/out-dir]

Teragen (1TB equals 10B x 100 byte rows)

/usr/bin/yarn jar /usr/iop/current/hadoop-mapreduce-client/hadoop-mapreduce-examples.jar teragen -Dmapreduce.job.maps=144 -Ddfs.blocksize=536870912 -Dmapreduce.map.log.level=WARN -Dmapreduce.map.speculative=false -Dmapreduce.map.output.compress.codec=org.apache.hadoop.io.compress.SnappyCodec -Dmapreduce.map.output.compress=true -Ddfs.replication=3 10000000000 [/user/ambari-qa/in-dir]

Terasort

/usr/bin/yarn jar /usr/iop/current/hadoop-mapreduce-client/hadoop-mapreduce-examples.jar terasort -Dmapreduce.job.reduces=144 -Ddfs.blocksize=536870912 -Dmapreduce.map.output.compress.codec=org.apache.hadoop.io.compress.SnappyCodec -Dmapreduce.map.output.compress=true [/user/ambari-qa/in-dir /user/ambari-qa/out-dir]

Leave a comment