Example of using Docker Swarm to build a distributed crawler cluster

Example of using Docker Swarm to build a distributed crawler cluster

During the crawler development process, you must have encountered a situation where you need to deploy the crawler on multiple servers. How do you operate at this time? SSH into each server one by one, pull down the code using git, and then run? The code has been modified, so we need to log in to each server one by one and update it in sequence?

Sometimes the crawler only needs to run on one server, sometimes it needs to run on 200 servers. How do you switch quickly? Log in to each server one by one to turn it on and off? Or be smart and set a modifiable flag in Redis so that only the crawler on the server corresponding to the flag runs?

Crawler A has been deployed on all servers. Now you have made a crawler B. Do you have to log in to each server one by one and deploy it again?

If you did, you should regret not seeing this article sooner. After reading this article, you will be able to:

Deploy a new crawler to 50 servers in 2 minutes:

docker build -t localhost:8003/spider:0.01 .
docker push localhost:8002/spider:0.01
docker service create --name spider --replicas 50 --network host 45.77.138.242:8003/spider:0.01

Scaling crawlers from 50 to 500 servers in 30 seconds:

docker service scale spider=500

Batch shut down crawlers on all servers within 30 seconds:

docker service scale spider=0

Batch update crawlers on all machines within 1 minute:

docker build -t localhost:8003/spider:0.02 .
docker push localhost:8003/spider:0.02
docker service update --image 45.77.138.242:8003/spider:0.02 spider

This article will not teach you how to use Docker, so please make sure you have some Docker basics before reading this article.

What is Docker Swarm?

Docker Swarm is a cluster management module that comes with Docker. It can create and manage Docker clusters.

Environment Construction

This article will use three Ubuntu 18.04 servers for demonstration. The three servers are arranged as follows:

Master: 45.77.138.242

Slave-1: 199.247.30.74

Slave-2: 95.179.143.21

Docker Swarm is a module based on Docker, so you must first install Docker on three servers. After installing Docker, all operations are completed in Docker.

Install Docker on the Master

Install Docker on the Master server by executing the following commands in sequence:

apt-get update
apt-get install -y apt-transport-https ca-certificates curl software-properties-common
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -
add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu bionic stable"
apt-get update
apt-get install -y docker-ce

Creating a Manager Node

A Docker Swarm cluster requires a Manager node. Now initialize the Master server as the Manager node of the cluster. Run the following command.

docker swarm init

After the run is complete, you can see the returned results as shown in the figure below.

In this return result, a command is given:

Copy the code as follows:
docker swarm join --token SWMTKN-1-0hqsajb64iynkg8ocp8uruktii5esuo4qiaxmqw2pddnkls9av-dfj7nf1x3vr5qcj4cqiusu4pv 45.77.138.242:2377

This command needs to be executed in each slave node. Now record this command.

After initialization is complete, you will get a Docker cluster with only one server. Execute the following command:

docker node ls

You can see the current status of the cluster, as shown in the following figure.

Create a private origin (optional)

Creating a private origin is not required. The reason why a private source is needed is that the project's Docker image may involve company secrets and cannot be uploaded to a public platform such as DockerHub. If your image can be publicly uploaded to DockerHub, or you already have a private image repository available, you can use that directly and skip this section and the next one.

The private source itself is also a Docker image. Pull it down first:

docker pull registry:latest

As shown in the figure below.

Now start the private source:

Copy the code as follows:
docker run -d -p 8003:5000 --name registry -v /tmp/registry:/tmp/registry docker.io/registry:latest

As shown in the figure below.

In the startup command, the open port is set to port 8003, so the address of the private source is: 45.77.138.242:8003

hint:

The private source built in this way uses HTTP and has no permission verification mechanism, so if it is open to the public network, you need to use a firewall to make an IP whitelist to ensure data security.

Allow Docker to use trusted http private origins (optional)

If you use the command in the previous section to build your own private source, since Docker does not allow the use of HTTP private sources by default, you need to configure Docker to trust it.

Configure Docker using the following command:

echo '{ "insecure-registries":["45.77.138.242:8003"] }' >> /etc/docker/daemon.json

Then restart docker using the following command.

systemctl restart docker

As shown in the figure below.

After the restart is complete, the Manager node is configured.

Create a child node initialization script

For the Slave server, only three things need to be done:

  • Install Docker
  • Joining a Cluster
  • Trust Source

From now on, all other tasks will be managed by Docker Swarm itself, and you will no longer need to log in to the server through SSH.

To simplify the operation, you can write a shell script to run it in batches. Create an init.sh file on Slave-1 and Slave-2 servers with the following content.

apt-get update
apt-get install -y apt-transport-https ca-certificates curl software-properties-common
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -
add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu bionic stable"
apt-get update
apt-get install -y docker-ce
echo '{ "insecure-registries":["45.77.138.242:8003"] }' >> /etc/docker/daemon.json
systemctl restart docker 
docker swarm join --token SWMTKN-1-0hqsajb64iynkg8ocp8uruktii5esuo4qiaxmqw2pddnkls9av-dfj7nf1x3vr5qcj4cqiusu4pv 45.77.138.242:2377

Make this file available and run:

chmod +x init.sh
./init.sh

As shown in the figure below.

After the script is finished, you can log out from SSH on Slave-1 and Slave-2. There is no need to come in again in the future.

Go back to the Master server and execute the following command to confirm that the cluster now has 3 nodes:

docker node ls

You can see that there are now 3 nodes in the cluster. As shown in the figure below.

So far, the most complicated and troublesome process has been over. All that’s left is to experience the convenience brought by Docker Swarm.

Creating a test program

Build and test Redis

Since we need to simulate the running effect of a distributed crawler, we first use Docker to build a temporary Redis service:

Execute the following command on the Master server:

Copy the code as follows:
docker run -d --name redis -p 7891:6379 redis --requirepass "KingnameISHandSome8877"

This Redis uses port 7891 for external use, the password is KingnameISHandSome8877 , and the IP is the IP address of the Master server.

Writing a test program

Write a simple Python program:

import time
import redis


client = redis.Redis(host='45.77.138.242', port='7891', password='KingnameISHandSome8877')

while True:
  data = client.lpop('example:swarm:spider')
  if not data:
    break
  print(f'The data I am getting now is: {data.decode()}')
  time.sleep(10)

This Python program reads a number from Redis every 10 seconds and prints it out.

Writing a Dockerfile

Write Dockerfile to create our own image based on the Python 3.6 image:

from python:3.6
label mantainer='[email protected]'

user root
ENV PYTHONUNBUFFERED=0
ENV PYTHONIOENCODING=utf-8

run python3 -m pip install redis

copy spider.py spider.py
cmd python3 spider.py

Build the image

After writing the Dockerfile, execute the following command to start building our own image:

docker build -t localhost:8003/spider:0.01 .

It is important to note that since we want to upload this image to a private source for download by the slave node on the Slave server, the image naming method needs to satisfy the format of localhost:8003/自定義名字:版本號.自定義名字and版本號can be modified according to actual conditions. In the example of this article, I named it spider because I want to simulate a crawler program. Since it is the first build, the version number is 0.01.

The whole process is shown in the figure below.

Upload the image to a private repository

After the image is built, it needs to be uploaded to a private source. At this time, you need to execute the command:

docker push localhost:8003/spider:0.01

As shown in the figure below.

Everyone remember the build and upload commands. You will need to use these two commands every time you update the code in the future.

Creating a Service

Docker Swarm runs services one by one, so you need to use the docker service command to create services.

Copy the code as follows:
docker service create --name spider --network host 45.77.138.242:8003/spider:0.01

This command creates a service called spider . By default, 1 container is run. The operation is shown in the figure below.

Of course, you can also run multiple containers at once by adding a --replicas parameter. For example, once a service is created, it will be run with 50 containers:

Copy the code as follows:
docker service create --name spider --replicas 50 --network host 45.77.138.242:8003/spider:0.01

However, the initial code may have many bugs, so it is recommended to use one container to run it first, observe the logs, and then expand it after finding no problems.

Back to the default case of one container, this container may be on any of the three machines currently. Execute the following command to observe the operation of this default container:

docker service ps spider

As shown in the figure below.

View Node Log

According to the execution results in the figure above, you can see that the ID of the running container is rusps0ofwids , so execute the following command to dynamically view the Log:

docker service logs -f container ID

At this time, the log of this container will continue to be tracked. As shown in the figure below.

Horizontal Scaling

Now, there is only one server running a container. I want to use three servers to run this crawler, so I just need to execute one command:

docker service scale spider=3

The running effect is shown in the figure below.

At this point, check the crawler's running status again and you can find that a container is running on each of the three machines. As shown in the figure below.

Now, we log in to the slave-1 machine to see if there is actually a task running. As shown in the figure below.

You can see that there is indeed a container running on it. This is automatically assigned by Docker Swarm.

Now we use the following command to forcibly shut down Docker on slave-1 and see the effect.

systemctl stop docker

Go back to the master server and check the running effect of the crawler again, as shown in the figure below.

As you can see, after Docker Swarm detects that Slave-1 is offline, it will automatically find a new machine to start the task, ensuring that there are always three tasks running. In this example, Docker Swarm automatically starts two spider containers on the master machine.

If the machine performance is good, you can even run more containers on each machine:

docker service scale spider=10

At this point, 10 containers will be started to run these crawlers. These 10 crawlers are isolated from each other.

What if you want to stop all crawlers? Very simple, one command:

docker service scale spider=0

This will stop all crawlers.

View logs of multiple containers simultaneously

What if you want to see all containers at the same time? You can use the following command to view the latest 20 lines of logs for all containers:

Copy the code as follows:
docker service ps robot | grep Running | awk '{print $1}' | xargs -i docker service logs --tail 20 {}

In this way, the logs will be displayed in order. As shown in the figure below.

Update crawler

If you make changes to your code. Then you need to update your crawler.

First modify the code, rebuild it, and resubmit the new image to the private source. As shown in the figure below.

Next you need to update the image in the service. There are two ways to update the image. One is to close all crawlers first and then update.

docker service scale spider=0
docker service update --image 45.77.138.242:8003/spider:0.02 spider
docker service scale spider=3

The second is to directly execute the update command.

docker service update --image 45.77.138.242:8003/spider:0.02 spider

The difference between them is that when the update command is executed directly, the running containers will be updated one by one.

The running effect is shown in the figure below.

You can do more with Docker Swarm

This article uses an example of a simulated crawler, but obviously, any program that can be run in batches can be run with Docker Swarm, whether you use Redis or Celery to communicate, whether you need communication or not, as long as it can be run in batches, you can use Docker Swarm.

In the same Swarm cluster, you can run multiple different services without affecting each other. You can truly build a Docker Swarm cluster once and then never have to worry about it again. All future operations only need to be run on the server where the Manager node is located.

The above is the full content of this article. I hope it will be helpful for everyone’s study. I also hope that everyone will support 123WORDPRESS.COM.

You may also be interested in:
  • Analysis of the use and principle of Docker Swarm cluster management
  • How to use Docker Swarm to build WordPress
  • How to use Docker Swarm to build a cluster
  • Detailed explanation of using Docker 1.12 to build a multi-host Docker swarm cluster
  • How to install Docker and use it in Docker Swarm mode
  • Docker swarm simple tutorial

<<:  Summary of js execution context and scope

>>:  Detailed steps for installing and configuring mysql 5.6.21

Recommend

Vue Element-ui form validation rule implementation

Table of contents 1. Introduction 2. Entry mode o...

Analysis of the implementation principle of Vue instructions

Table of contents 1. Basic Use 2. Working Princip...

MySQL partitioning practice through Navicat

MySQL partitioning is helpful for managing very l...

This article will show you how to use Vue 3.0 responsive

Table of contents Use Cases Reactive API related ...

Vue SPA first screen optimization solution

Table of contents Preface optimization SSR Import...

Detailed explanation of Mencached cache configuration based on Nginx

Introduction Memcached is a distributed caching s...

JavaScript Canvas implements Tic-Tac-Toe game

This article shares the specific code of JavaScri...

Using js to achieve waterfall effect

This article example shares the specific code of ...

The forgotten button tag

Note: This article has been translated by someone ...

Example of how to quickly build a LEMP environment with Docker

LEMP (Linux + Nginx + MySQL + PHP) is basically a...

Docker deploys nginx and mounts folders and file operations

During this period of time, I was studying docker...

Use CSS to draw a file upload pattern

As shown below, if it were you, how would you ach...

Docker+nextcloud to build a personal cloud storage system

1. Docker installation and startup yum install ep...