A brief discussion on the construction and operation mechanism of the real-time computing framework Flink cluster

1. Flink Overview

1.1 Basic Introduction
1.2 Application Scenarios

2. Environment Deployment

2.1. Installation package management
2.2 Cluster Configuration
2.3. Start and stop
2.4 Web Interface

3. Development Entry Case

3.1 Data Script
3.2. Introducing basic dependencies
3.3. Read file data
3.4. Read port data

IV. Operation Mechanism

4.1、FlinkClient
4.2 JobManager
4.3 TaskManager

5. Source code address

1. Flink Overview

1.1 Basic Introduction

The main features include: batch and stream integration, precise state management, event time support, and exactly-once state consistency guarantee. Flink can not only run on a variety of resource management frameworks including YARN, Mesos, and Kubernetes, but also supports independent deployment on bare metal clusters. With the high availability option enabled, there is no single point of failure.

There are two concepts to explain here:

Boundary: Unbounded and bounded data flows, which can be understood as data aggregation strategies or conditions;
Status: whether there is a dependency in the execution order, that is, whether the next execution depends on the previous result;

1.2 Application Scenarios

Data Driven

Event-driven applications do not need to query remote databases. Local data access enables them to have higher throughput and lower latency. Taking the anti-fraud case as an example, DataDriven writes the processing rule model into the DatastreamAPI, and then abstracts the entire logic to the Flink engine. When events or data flow in, the corresponding rule model will be triggered. Once the conditions in the rule are triggered, DataDriven will quickly process it and notify the business application.

Data Analytics

Compared with batch analysis, streaming analysis eliminates the need for periodic data import and query processes, so the latency of obtaining indicators from events is lower. In addition, batch queries must deal with artificial data boundaries caused by periodic imports and input bounds, while streaming queries do not need to consider this problem. Flink provides good support for both continuous streaming analysis and batch analysis, and processes and analyzes data in real time. It is widely used in scenarios such as real-time large screens and real-time reports.

Data Pipeline

Compared with periodic ETL tasks, continuous data pipelines can significantly reduce the latency of moving data to the destination. For example, based on the upstream StreamETL, real-time data cleaning or expansion can be performed, and a real-time data warehouse can be built downstream to ensure the timeliness of data queries and form a high-efficiency data query link. This scenario is very common in media stream recommendations or search engines.

2. Environment Deployment

2.1. Installation package management

[root@hop01 opt]# tar -zxvf flink-1.7.0-bin-hadoop27-scala_2.11.tgz
[root@hop02 opt]# mv flink-1.7.0 flink1.7

2.2 Cluster Configuration

Management Node

[root@hop01 opt]# cd /opt/flink1.7/conf
[root@hop01 conf]# vim flink-conf.yaml
jobmanager.rpc.address: hop01

Distributed Nodes

[root@hop01 conf]# vim slaves
hop02
hop03

The two configurations are synchronized to all cluster nodes.

2.3. Start and stop

/opt/flink1.7/bin/start-cluster.sh
/opt/flink1.7/bin/stop-cluster.sh

Startup log:

[root@hop01 conf]# /opt/flink1.7/bin/start-cluster.sh
Starting cluster.
Starting standalonesession daemon on host hop01.
Starting taskexecutor daemon on host hop02.
Starting taskexecutor daemon on host hop03.

2.4 Web Interface

Visit: http://hop01:8081/

3. Development Entry Case

3.1 Data Script

Distribute a data script to each node:

/var/flink/test/word.txt

3.2. Introducing basic dependencies

Here is a basic case written in Java.

<dependencies>
    <dependency>
        <groupId>org.apache.flink</groupId>
        <artifactId>flink-java</artifactId>
        <version>1.7.0</version>
    </dependency>
    <dependency>
        <groupId>org.apache.flink</groupId>
        <artifactId>flink-streaming-java_2.11</artifactId>
        <version>1.7.0</version>
    </dependency>
</dependencies>

3.3. Read file data

Here, the data in the file is read directly, and the number of times each word appears is analyzed through the program flow.

public class WordCount {
    public static void main(String[] args) throws Exception {
        // Read file data readFile ();
    }

    public static void readFile () throws Exception {
        // 1. Create execution environment ExecutionEnvironment environment = ExecutionEnvironment.getExecutionEnvironment();

        // 2. Read data file String filePath = "/var/flink/test/word.txt";
        DataSet<String> inputFile = environment.readTextFile(filePath);

        // 3. Group and sum DataSet<Tuple2<String, Integer>> wordDataSet = inputFile.flatMap(new WordFlatMapFunction(
        )).groupBy(0).sum(1);

        // 4. Print processing results wordDataSet.print();
    }

    //Data reading and cutting method static class WordFlatMapFunction implements FlatMapFunction<String, Tuple2<String, Integer>> {
        @Override
        public void flatMap(String input, Collector<Tuple2<String, Integer>> collector){
            String[] wordArr = input.split(",");
            for (String word : wordArr) {
                collector.collect(new Tuple2<>(word, 1));
            }
        }
    }
}

3.4. Read port data

Create a port on the hop01 service and simulate sending some data to the port:

[root@hop01 ~]# nc -lk 5566
c++,java

Use the Flink program to read and analyze the data content of the port:

public class WordCount {
    public static void main(String[] args) throws Exception {
        // Read port data readPort ();
    }

    public static void readPort () throws Exception {
        // 1. Create execution environment StreamExecutionEnvironment environment = StreamExecutionEnvironment.getExecutionEnvironment();

        // 2. Read the Socket data port DataStreamSource<String> inputStream = environment.socketTextStream("hop01", 5566);

        // 3. Data reading and cutting method SingleOutputStreamOperator<Tuple2<String, Integer>> resultDataStream = inputStream.flatMap(
                new FlatMapFunction<String, Tuple2<String, Integer>>()
        {
            @Override
            public void flatMap(String input, Collector<Tuple2<String, Integer>> collector) {
                String[] wordArr = input.split(",");
                for (String word : wordArr) {
                    collector.collect(new Tuple2<>(word, 1));
                }
            }
        }).keyBy(0).sum(1);

        // 4. Print analysis results resultDataStream.print();

        // 5. Environment startup environment.execute();
    }
}

IV. Operation Mechanism

4.1、FlinkClient

The client is used to prepare and send data streams to the JobManager node. Then, according to specific needs, the client can directly disconnect or maintain the connection status and wait for the task processing results.

4.2 JobManager

In a Flink cluster, a JobManger node and at least one TaskManager node are started. After the JobManager receives the task submitted by the client, it coordinates and sends the task to a specific TaskManager node for execution. The TaskManager node sends heartbeat and processing information to the JobManager.

4.3 TaskManager

A slot is the smallest resource scheduling unit in TaskManager. The number of slots is set at startup. Each slot can start a task, receive tasks deployed by the JobManager node, and perform specific analysis and processing.

5. Source code address

GitHub Address

https://github.com/cicadasmile/big-data-parent

GitEE Address

https://gitee.com/cicadasmile/big-data-parent

The above is a brief discussion of the details of the construction and operation mechanism of the real-time computing framework Flink cluster. For more information about the construction and operation mechanism of the real-time computing framework Flink cluster, please pay attention to other related articles on 123WORDPRESS.COM!

You may also be interested in:

Detailed explanation of memory management of Flink, a big data processing engine
Detailed steps for implementing timeout status monitoring in Apache FlinkCEP
What data types does Flink support?
Java lambda expression to implement Flink WordCount process analysis
Big Data HelloWorld-Flink implements WordCount
Analyze Flink's core principles and implement core abstractions

<<: The visual design path of the website should conform to user habits

>>: Solution to the problem of MySQL deleting and inserting data very slowly