Introduction to Spark and comparison with Hadoop

1. Spark vs. Hadoop

1.1 Disadvantages of Haoop
1.2 Advantages over Hadoop MR

2. Spark Ecosystem

2.1 Three Types of Big Data Processing

1. Complex batch data processing
2. Interactive query based on historical data
3. Data processing based on real-time data stream

2.2 BDAS Architecture

2.3 Spark Ecosystem

3. Basic concepts and architecture design

3.1 Basic Concepts

3.2 Operational Architecture

3.3 Relationships between various concepts

4. Spark runs the basic process

4.1 Operation Process

4.2 Operational Architecture Features

5. Spark deployment and application methods

5.1 Three deployment methods of Spark

5.1.1 Standalone
5.1.2 Spark on Mesos
5.1.3 Spark on YARN

5.2 From Hadoop+Storm Architecture to Spark Architecture

Hadoop+Storm architecture
Using Spark architecture to meet batch and stream processing needs
Advantages of Spark architecture:

5.3 Unified Deployment of Hadoop and Spark

Different computing frameworks run uniformly in YARN

1. Spark vs. Hadoop

1.1 Disadvantages of Haoop

1. Limited ability to express oneself;
2. Disk IO overhead is high;
3. High latency;
4. The connection between tasks involves IO overhead;
5. Other tasks cannot start before the previous task is completed, making it difficult to handle complex, multi-stage computing tasks.

1.2 Advantages over Hadoop MR

1. Spark's computing model also belongs to MR, but it is not limited to Map and Reduce operations. It also provides a variety of data set operation types, and its programming model is more flexible than Hadoop MR.
2. Spark provides in-memory computing, which can store intermediate results in memory, making iterative operations more efficient;
3. Spark's DAG-based task scheduling execution mechanism is superior to Hadoop MR's iterative execution mechanism.

	Spark	MapReduce
Data storage structure	Use memory to build a resilient distributed dataset (RDD) to perform operations and cache on data.	Disk HDFS file system split
Programming Paradigm	DAG (Transformation+Action)	Map+Reduce
Storage of intermediate calculation results	Maintained in memory, access speed is several orders of magnitude higher than disk	Falling to disk, the IO and serialization and deserialization costs are high
Task maintenance method	Threads	process
time	Sub-second latency for reading small data sets	It takes several seconds to start the task

2. Spark Ecosystem

2.1 Three Types of Big Data Processing

1. Complex batch data processing

The time span is from tens of minutes to several hours

Haoop MapReduce

2. Interactive query based on historical data

The time span is from tens of seconds to several minutes

The real-time performance of Cloudera and Impala is better than that of Hive.

3. Data processing based on real-time data stream

The time span is from hundreds of milliseconds to several seconds

Storm

2.2 BDAS Architecture

2.3 Spark Ecosystem

3. Basic concepts and architecture design

3.1 Basic Concepts

3.2 Operational Architecture

Advantages of Spark using Executor: (Compared to Hadoop's MR)

1. Use multithreading to execute specific tasks and reduce the startup overhead of tasks;
2. There is a BlockManager storage module in Executor, which uses both memory and disk as storage devices, effectively reducing IO overhead.

3.3 Relationships between various concepts

An Application consists of a Driver and several Jobs
A Job consists of multiple Stages
A Stage consists of multiple Tasks that have no shuffle relationship.

When executing an Application, the Driver will request resources from the cluster manager and start the Executor.

And send the application code and files to the Executor, and then execute the Task on the Executor. After the run is completed,

The execution results will be returned to the Driver or written to HDFS or other databases.

4. Spark runs the basic process

4.1 Operation Process

1. Build a basic operating environment for the application. That is, the Driver creates a SparkContext to apply for resources, allocate tasks, and monitor them.

2. The Resource Manager allocates resources to the Executor and starts the Executor process.

3.1 SparkContext builds a DAG graph based on the dependencies of RDD, submits the DAG graph to DAGScheduler to be parsed into Stage, and then submits each TaskSet to the underlying scheduler TaskScheduler for processing.
3.2 Executor applies for Task from SparkContext, TaskScheduler sends Task to Executor to run and provides application code.

4. The Task runs on the Executor and feeds back the execution results to the TaskScheduler, and then to the DAGScheduler. After the execution is completed, the data is written and all resources are released.

4.2 Operational Architecture Features

1. Each Application has its own Executor process, and the process remains resident while the Application is running. The Executor process runs Task in a multi-threaded manner.

2. The Spark running process has nothing to do with the resource manager, as long as it can obtain the Executor process and maintain communication.

3. Task uses optimization mechanisms such as data locality and speculative execution. (Computation moves closer to data.)

5. Spark deployment and application methods

5.1 Three deployment methods of Spark

5.1.1 Standalone

Similar to MR1.0, slot is the resource allocation unit, but the performance is not good.

5.1.2 Spark on Mesos

Mesos and Spark have a certain affinity.

5.1.3 Spark on YARN

The connection between Mesos and Yarn

5.2 From Hadoop+Storm Architecture to Spark Architecture

Hadoop+Storm architecture

This deployment method is more complicated.

Using Spark architecture to meet batch and stream processing needs

Spark uses fast small batch computing to simulate stream computing, but it is not real stream computing.

It is impossible to achieve millisecond-level stream computing. For enterprise applications that require millisecond-level real-time response, stream computing frameworks such as Storm are still needed.

Advantages of Spark architecture:

1. One-click installation and configuration, thread-level task monitoring and alarming;
2. Reduce the difficulty of hardware cluster, software maintenance, task monitoring and application development;
3. It is easy to create a unified hardware and computing platform resource pool.

5.3 Unified Deployment of Hadoop and Spark

Different computing frameworks run uniformly in YARN

The benefits are as follows:

1. Computing resources can be scaled up or down on demand;
2. No need to mix and match load applications, high cluster utilization;
3. Share underlying storage to avoid data migration across clusters

status quo:

1. Spark cannot currently replace the functions implemented by some components in the Hadoop ecosystem.

2. It costs a certain amount of money to completely migrate existing applications developed with Hadoop components to Spark.

This is the end of this article about Spark introduction and comparison analysis with Hadoop. For more relevant Spark and Hadoop content, please search 123WORDPRESS.COM’s previous articles or continue to browse the following related articles. I hope everyone will support 123WORDPRESS.COM in the future!

You may also be interested in:

How to install the standalone version of spark in linux environment without using hadoop
A brief discussion of seven common Hadoop and Spark project cases

<<: CSS3 gradient background compatibility issues

>>: Using js to realize dynamic background