Introduction to Spark and comparison with Hadoop

Introduction to Spark and comparison with Hadoop

1. Spark vs. Hadoop

1.1 Disadvantages of Haoop

  • 1. Limited ability to express oneself;
  • 2. Disk IO overhead is high;
  • 3. High latency;
  • 4. The connection between tasks involves IO overhead;
  • 5. Other tasks cannot start before the previous task is completed, making it difficult to handle complex, multi-stage computing tasks.

1.2 Advantages over Hadoop MR

  • 1. Spark's computing model also belongs to MR, but it is not limited to Map and Reduce operations. It also provides a variety of data set operation types, and its programming model is more flexible than Hadoop MR.
  • 2. Spark provides in-memory computing, which can store intermediate results in memory, making iterative operations more efficient;
  • 3. Spark's DAG-based task scheduling execution mechanism is superior to Hadoop MR's iterative execution mechanism.
Spark MapReduce
Data storage structure Use memory to build a resilient distributed dataset (RDD) to perform operations and cache on data. Disk HDFS file system split
Programming Paradigm DAG (Transformation+Action) Map+Reduce
Storage of intermediate calculation results Maintained in memory, access speed is several orders of magnitude higher than disk Falling to disk, the IO and serialization and deserialization costs are high
Task maintenance method Threads process
time Sub-second latency for reading small data sets It takes several seconds to start the task

2. Spark Ecosystem

2.1 Three Types of Big Data Processing

1. Complex batch data processing

The time span is from tens of minutes to several hours

Haoop MapReduce

2. Interactive query based on historical data

The time span is from tens of seconds to several minutes

The real-time performance of Cloudera and Impala is better than that of Hive.

3. Data processing based on real-time data stream

The time span is from hundreds of milliseconds to several seconds

Storm

2.2 BDAS Architecture

2.3 Spark Ecosystem

3. Basic concepts and architecture design

3.1 Basic Concepts

3.2 Operational Architecture

Advantages of Spark using Executor: (Compared to Hadoop's MR)

  • 1. Use multithreading to execute specific tasks and reduce the startup overhead of tasks;
  • 2. There is a BlockManager storage module in Executor, which uses both memory and disk as storage devices, effectively reducing IO overhead.

3.3 Relationships between various concepts

  • An Application consists of a Driver and several Jobs
  • A Job consists of multiple Stages
  • A Stage consists of multiple Tasks that have no shuffle relationship.

When executing an Application, the Driver will request resources from the cluster manager and start the Executor.

And send the application code and files to the Executor, and then execute the Task on the Executor. After the run is completed,

The execution results will be returned to the Driver or written to HDFS or other databases.

4. Spark runs the basic process

4.1 Operation Process

1. Build a basic operating environment for the application. That is, the Driver creates a SparkContext to apply for resources, allocate tasks, and monitor them.

2. The Resource Manager allocates resources to the Executor and starts the Executor process.

  • 3.1 SparkContext builds a DAG graph based on the dependencies of RDD, submits the DAG graph to DAGScheduler to be parsed into Stage, and then submits each TaskSet to the underlying scheduler TaskScheduler for processing.
  • 3.2 Executor applies for Task from SparkContext, TaskScheduler sends Task to Executor to run and provides application code.

4. The Task runs on the Executor and feeds back the execution results to the TaskScheduler, and then to the DAGScheduler. After the execution is completed, the data is written and all resources are released.

4.2 Operational Architecture Features

1. Each Application has its own Executor process, and the process remains resident while the Application is running. The Executor process runs Task in a multi-threaded manner.

2. The Spark running process has nothing to do with the resource manager, as long as it can obtain the Executor process and maintain communication.

3. Task uses optimization mechanisms such as data locality and speculative execution. (Computation moves closer to data.)

5. Spark deployment and application methods

5.1 Three deployment methods of Spark

5.1.1 Standalone

Similar to MR1.0, slot is the resource allocation unit, but the performance is not good.

5.1.2 Spark on Mesos

Mesos and Spark have a certain affinity.

5.1.3 Spark on YARN

The connection between Mesos and Yarn

5.2 From Hadoop+Storm Architecture to Spark Architecture

Hadoop+Storm architecture

This deployment method is more complicated.

Using Spark architecture to meet batch and stream processing needs

Spark uses fast small batch computing to simulate stream computing, but it is not real stream computing.

It is impossible to achieve millisecond-level stream computing. For enterprise applications that require millisecond-level real-time response, stream computing frameworks such as Storm are still needed.

Advantages of Spark architecture:

  • 1. One-click installation and configuration, thread-level task monitoring and alarming;
  • 2. Reduce the difficulty of hardware cluster, software maintenance, task monitoring and application development;
  • 3. It is easy to create a unified hardware and computing platform resource pool.

5.3 Unified Deployment of Hadoop and Spark

Different computing frameworks run uniformly in YARN

The benefits are as follows:

  • 1. Computing resources can be scaled up or down on demand;
  • 2. No need to mix and match load applications, high cluster utilization;
  • 3. Share underlying storage to avoid data migration across clusters

status quo:

1. Spark cannot currently replace the functions implemented by some components in the Hadoop ecosystem.

2. It costs a certain amount of money to completely migrate existing applications developed with Hadoop components to Spark.

This is the end of this article about Spark introduction and comparison analysis with Hadoop. For more relevant Spark and Hadoop content, please search 123WORDPRESS.COM’s previous articles or continue to browse the following related articles. I hope everyone will support 123WORDPRESS.COM in the future!

You may also be interested in:
  • How to install the standalone version of spark in linux environment without using hadoop
  • A brief discussion of seven common Hadoop and Spark project cases

<<:  CSS3 gradient background compatibility issues

>>:  Using js to realize dynamic background

Recommend

The functions and differences between disabled and readonly

1: readonly is to lock this control so that it can...

Summary of Vue first screen performance optimization component knowledge points

Vue first screen performance optimization compone...

About front-end JavaScript ES6 details

Table of contents 1. Introduction 1.1 Babel Trans...

VUE Getting Started Learning Event Handling

Table of contents 1. Function Binding 2. With par...

mysql three tables connected to create a view

Three tables are connected. Field a of table A co...

Vue component communication method case summary

Table of contents 1. Parent component passes valu...

You Probably Don’t Need to Use Switch Statements in JavaScript

Table of contents No switch, no complex code bloc...

Detailed steps for remote deployment of MySQL database on Linux

Linux remote deployment of MySQL database, for yo...

Why should the number of rows in a single MySQL table not exceed 5 million?

Today, let’s discuss an interesting topic: How mu...

5 ways to achieve the diagonal header effect in the table

Everyone must be familiar with table. We often en...

How to set mysql permissions using phpmyadmin

Table of contents Step 1: Log in as root user. St...

Solve the problem of docker container exiting immediately after starting

Recently I was looking at how Docker allows conta...

Practical record of vue using echarts word cloud chart

echarts word cloud is an extension of echarts htt...

Summary of MySQL lock related knowledge

Locks in MySQL Locks are a means to resolve resou...