Monitoring a Hadoop & Spark Cluster


Earlier, we presented a list of tools that could be used to monitor a Hadoop or Spark Cluster. Click HERE to access “Monitoring Tools for a Hadoop Cluster“.

In other posts, we provided information on monitoring cluster hardware – Click HERE to access “Monitoring RAID drive health“, Click HERE to access “Monitoring drive health using S.M.A.R.T” (JBOD drives) and Click HERE to access “Using IPMI to monitor / manage hardware and provide remote access

In addition to monitoring hardware or software health, some monitoring tools allow rules to be defined to proactively trigger alerts when Cluster Resources reach predefined limits for CPU, Memory, Storage and Network usage. Or, when a necessary service doesn’t respond within a defined amount of time.

The following section provides a list of recommendations for areas to monitor health within a cluster and possible alerts ranges. It should be noted that monitoring a cluster adds overhead to the cluster. Limits should be set to avoid unnecessary overhead to cluster resources while still allowing time for preventative measures. The amount of time necessary is unique to each DevOps Team.

Hardware

Server

  • Server available
  • IPMI hardware errors reported

Storage

  • Drives available
  • Storage controller available
  • IPMI hardware errors reported

Network

  • Network available
  • IPMI hardware errors reported
  • SNMP errors reported

Software

OS

  • Ping available, alert when not responding within 30 secs
  • SSH logon available, alert when not responding within 30 secs
  • CPU usage, alert when over 90% for three min period
  • Memory usage, alert when over 80% for three min period
  • Paging usage, alert when over 10%
  • Network bandwidth usage, alert when over 90% for three min period
  • IO sub-system usage, alert when over 90% for three min period

File System

  • File System available
  • File System storage limit, alert when at 70% of limit

Hadoop Services

  • Hadoop services available, alert when service not responding within 30 secs
    • Name node, Sec Name Node, Job Tracker, Task Tracker
  • HBase services available, alert when service not responding within 30 secs
  • Spark services available, alert when service not responding within 30 secs
  • Custom Job Duration checks for jobs running outside of expected ranges

 Additional Services

  • BigInsight Console Available, alert when not responding within 30 secs
  • Big SQL Available, alert when not responding within 30 secs
  • BigSheets Available, alert when not responding within 30 secs

Leave a comment