1. Job Execution Fault ToleranceFlink's error recovery mechanism is divided into multiple levels, namely the Execution-level Failover strategy and the ExecutionGraph-level Job Restart strategy. When an error occurs, Flink will first try to trigger a small-scale error recovery mechanism. If it still cannot be handled, it will upgrade to a larger-scale error recovery mechanism. For details, see the sequence diagram below. When a task error occurs, TaskManager notifies JobManager through RPC, which changes the status of the corresponding Execution to failed and triggers the Failover strategy. If the failover policy is met, the JobManager will restart the Execution, otherwise it will be upgraded to a failure of the ExecutionGraph. If an ExecutionGraph fails, it enters the failing state, and the Restart strategy determines whether it restarts (restarting state) or exits abnormally (failed state). 1.1 Task Failover StrategyThere are currently three Task Failover strategies: RestartAll, RestartIndividualStrategy, and RestartPipelinedRegionStrategy. RestartAll: Restarting all tasks is the safest strategy for restoring job consistency and is used as a fallback strategy when other failover strategies fail. Currently the default Task Failover strategy. RestartPipelinedRegionStrategy: Restart all tasks in the region where the error task is located. The Task Region is determined by the data transmission of the Task. Tasks with data transmission will be placed in the same Region, and there is no data exchange between different Regions. RestartIndividualStrategy: Resume a single Task. Because if the Task does not contain a data source, it will not be able to re-stream data and may result in some data loss. Considering the provision of at least once delivery semantics, the scope of use of this strategy is relatively limited and is only applied to jobs where there is no data transfer between tasks. 1.2 Job Restart StrategyIf a Task error eventually triggers a Full Restart, the Job Restart strategy will control whether the job needs to be resumed. Flink provides three job-specific Restart Strategies. FixedDelayRestartStrategy: Allows execution failure within a specified number of times. If the number is exceeded, the job will fail. FixedDelayRestartStrategy restart can set a certain delay to reduce the load on external systems and unnecessary error logs caused by frequent retries. FailureRateRestartStrategy: Allows execution to fail within a specified number of times within a specified time window. If this frequency is exceeded, the job will fail. Similarly, FailureRateRestartStrategy can also set a certain restart delay. NoRestartStrategy: Fail the Job directly when Execution fails. 2. Daemon Fault ToleranceIn the deployment mode of Flink on YARN, the key daemons are JobManager and TaskManager. The main responsibilities of JobManager are to coordinate resources and manage the execution of jobs, which are undertaken by ResourceManager and JobMaster daemon threads respectively. The relationship between the three is shown in the following figure. 2.1 TaskManager Fault ToleranceIf the ResourceManager detects a TaskManager failure through a heartbeat timeout or is notified by the cluster manager, it notifies the corresponding JobMaster and starts a new TaskManager to replace it. Note that the ResourceManager does not care about the Flink job, it is the JobMaster's responsibility to manage how the Flink job reacts. If the JobMaster learns about the TaskManager failure through notification from the ResourceManager or detects a heartbeat timeout, it first removes the TaskManager from its slot pool and marks all Tasks running on the TaskManager as failed, thereby triggering the fault tolerance mechanism of Flink job execution to recover the job. The TaskManager status has been written to the checkpoint and will be automatically restored after restart, so there will be no data inconsistency issues. 2.2. Fault Tolerance of ResourceManagerIf TaskManager detects ResourceManager failure through heartbeat timeout, or receives notification from Zookeeper that ResourceManager has lost leadership, TaskManager will look for a new leader, and ResourceManager will restart and register itself with it without interrupting the execution of Tasks. If JobMaster detects ResourceManager failure through heartbeat timeout, or receives notification from Zookeeper that ResourceManager has lost leadership, JobMaster will also wait for the new ResourceManager to become the leader and then re-request all TaskManagers. Considering that the TaskManager may also recover successfully, the TaskManager newly requested by the JobMaster will be released after being idle for a period of time. ResourceManager maintains a lot of status information, including active containers, available TaskManagers, the mapping relationship between TaskManagers and JobMasters, etc. However, this information is not ground truth and can be retrieved from the status synchronization with JobMaster and TaskManager, so this information does not need to be persisted. 2.3 JobMaster Fault ToleranceIf TaskManager detects JobMaster failure through heartbeat timeout, or receives notification from Zookeeper that JobMaster has lost leadership, TaskManager triggers its own error recovery and then waits for a new JobMaster. If the new JobMaster does not appear after a certain period of time, the TaskManager will mark its slot as free and inform the ResourceManager. If ResourceManager detects a JobMaster failure through a heartbeat timeout, or receives a notification from Zookeeper that JobMaster has lost leadership, ResourceManager will inform TaskManager of the failure and will not process any other actions. JobMaster saves a lot of states that are critical to job execution. Among them, JobGraph and user code will be retrieved from persistent storage such as HDFS, checkpoint information will be obtained from Zookeeper, Task execution information may not be restored because the entire job will be rescheduled, and the held slots will be restored from the synchronization information of ResourceManager's TaskManager. 2.4 Concurrent FailuresIn Flink on YARN deployment mode, because both JobMaster and ResourceManager are in the JobManager process, if there is a problem with the JobManager process, it is usually a concurrent failure of JobMaster and ResourceManager. Then TaskManager will handle it as follows:
It is worth noting that the new JobManager is automatically started by relying on YARN's Application attempt retry mechanism. According to the YARN Application: keep-containers-across-application-attempts behavior configured in Flink, the TaskManager will not be cleaned up, so it can be re-registered with the newly started Flink ResourceManager and JobMaster. ConclusionFlink's fault tolerance mechanism ensures the reliability and durability of Flink. Specifically, it includes two aspects: job execution fault tolerance and daemon fault tolerance. In terms of job execution fault tolerance, Flink provides Task-level Failover strategies and Job-level Restart strategies for automatic retries in case of failures. In terms of daemon fault tolerance, in on YARN mode, Flink performs failure detection through the heartbeat of internal components and YARN monitoring. The failure of a TaskManager can be recovered by applying for a new TaskManager and restarting the Task or Job. The failure of a JobManager can be recovered by the cluster manager automatically pulling up a new JobManager and re-registering the TaskManager with the new leader JobManager. The above is a brief discussion of the details of Flink's fault-tolerant mechanism, job execution and daemon. For more information about Flink's fault-tolerant mechanism, job execution and daemon, please pay attention to other related articles on 123WORDPRESS.COM! You may also be interested in:
|
<<: A brief discussion on the lock range of MySQL next-key lock
>>: Several methods of calling js in a are sorted out and recommended for use
Sometimes we need to import some data from anothe...
After installing VMware Tools, ① text can be copi...
1. Data backup 1. Use mysqldump command to back u...
First of all, the formation of web page style main...
Table of contents 1. Usage of DATETIME and TIMEST...
mysql storage engine: The MySQL server adopts a m...
Preface Project release always requires packaging...
Phenomenon There are several nested flex structur...
Do you add an alt attribute to the img image tag? ...
The most popular tag is IE8 Browser vendors are sc...
I re-read the source code of the Fabric project a...
1. Create a centos7.6 system and optimize the sys...
If you directly set the width attribute to the sty...
When the token expires, refresh the page. If the ...
Recommended reading: MySQL 8.0.19 supports accoun...