Hadoop or Spark? Which framework should I use?


Hadoop or Spark?

Apache Hadoop and Spark are both big-data frameworks, but don’t serve the same purposes. Hadoop is a distributed data infrastructure. It distributes massive data collections across multiple nodes within a cluster of commodity servers. It is able to manage that data, enabling big-data processing and analytics effectively. Spark is a data-processing tool that operates on distributed data collections. It does not provide distributed storage.

Can Hadoop and Spark be used stand-alone?

While Hadoop and Spark are best used together, Hadoop includes not just a storage component, known as the Hadoop Distributed File System (HDFS), but also a data processing component called MapReduce allowing Hadoop to be used without Spark. While Spark does not come with a file management system another cluster file system can be used (GPFS, ZFS, GFS2, GLuster, etc…) allowing Spark to be used without Hadoop.

Spark can be faster

Spark can be faster than MapReduce when it is able to hold a data set in memory. While MapReduce operates in steps, Spark attempts to hold results sets in memory. The MapReduce workflow read data from the cluster, perform an operation, write results to the cluster, read updated data from the cluster, perform next operation, write next results to the cluster, etc. Spark attempts to complete the operations using data in memory. Spark read data from the cluster, perform an operation, write results to memory (or spills to disk like MapReduce when the result won’t fit in memory), read updated data from the cluster, perform next operation, write next results to the cluster (either memory or disk). As long as the results of operations can be held in memory, Spark is faster than MapReduce by avoiding IO.

Speeding may not be required

MapReduce’s processing framework can be the best choice when data operations and reporting requirements are static or results sets can’t be held in memory. Clusters can be sized to make batch-mode processing times acceptable.

Spark’s processing framework is best when results can be held in memory. Or, real-time data needs to be handled such as sensor data.

Failure recovery

Hadoop is resilient to system faults or failures since data is written to disk after every operation. Spark has similar built-in resiliency. Its data objects use resilient distributed datasets that are distributed across a cluster. The data objects can be stored in memory or on disks and RDD provides full recovery from faults or failures.

Leave a comment