Explain the deployment and configuration of Clickhouse Docker cluster with examples

Explain the deployment and configuration of Clickhouse Docker cluster with examples

Written in front

I've found some time to update my experience with big data. When I was initially selecting the architecture, I considered using Hadoop as a data warehouse. However, Hadoop requires a high number of servers, with a cluster of at least 6 or more servers, so I chose Clickhouse (hereinafter referred to as CH). If CH is clustered, you can start with 3 servers. Of course, it is not mandatory. It depends on whether your zookeeper is clustered. Secondly, CH has more powerful performance. For scenarios where the volume is not very large, a single machine is enough to handle various OLAP scenarios.

Let’s get to the point. Related environment:

IP Server Name operating system Serve Remark
172.192.13.10 server01 Ubuntu 20.04 Two Clickhouse instances, Zookeeper

CH instance 1 ports: tcp 9000, http 8123, sync port 9009, MySQL 9004, type: primary shard 1

CH instance 2 ports: tcp 9000, http 8124, sync port 9010, MySQL 9005, type: replica of server02

172.192.13.11 server02 Ubuntu 20.04 Two Clickhouse instances, Zookeeper

CH instance 3 ports: tcp 9000, http 8123, sync port 9009, MySQL 9004, type: primary shard 2

CH instance 4 ports: tcp 9000, http 8124, sync port 9010, MySQL 9005, type: replica of server03

172.192.13.12 server03 Ubuntu 20.04 Two Clickhouse instances, Zookeeper

CH instance 5 ports: tcp 9000, http 8123, sync port 9009, MySQL 9004, type: primary shard 3

CH instance 6 ports: tcp 9000, http 8124, sync port 9010, MySQL 9005, type: replica of server01

Install docker on each service. There are three services installed in docker: ch-main, ch-sub, and zookeeper_node, as shown in the figure:

Careful observers have noticed that there is no mapping relationship between PORTS. The Docker host network mode is used here. The mode is simple and has high performance, which avoids many communication problems between containers or across servers. This has been a long time.

Environment deployment

1. Server environment configuration

Execute on each server: vim /etc/hosts, open hosts and add the following configuration:

172.192.13.10 server01
172.192.13.11 server02
172.192.13.12 server03

2. Install Docker

Too simple, slightly...

3. Pull clickhouse and zookeeper images

Too simple, slightly...

Zookeeper Cluster Deployment

Create a new folder to store the zookeeper configuration information at the location you want to store it on each server, here is /usr/soft/zookeeper/, and run the following startup commands on each server in turn:

Server01 executes:

docker run -d -p 2181:2181 -p 2888:2888 -p 3888:3888 --name zookeeper_node --restart always \
-v /usr/soft/zookeeper/data:/data \
-v /usr/soft/zookeeper/datalog:/datalog \
-v /usr/soft/zookeeper/logs:/logs \
-v /usr/soft/zookeeper/conf:/conf \
--network host \
-e ZOO_MY_ID=1 zookeeper

Server02 executes:

docker run -d -p 2181:2181 -p 2888:2888 -p 3888:3888 --name zookeeper_node --restart always \
-v /usr/soft/zookeeper/data:/data \
-v /usr/soft/zookeeper/datalog:/datalog \
-v /usr/soft/zookeeper/logs:/logs \
-v /usr/soft/zookeeper/conf:/conf \
--network host \
-e ZOO_MY_ID=2 zookeeper

Server03 executes:

docker run -d -p 2181:2181 -p 2888:2888 -p 3888:3888 --name zookeeper_node --restart always \
-v /usr/soft/zookeeper/data:/data \
-v /usr/soft/zookeeper/datalog:/datalog \
-v /usr/soft/zookeeper/logs:/logs \
-v /usr/soft/zookeeper/conf:/conf \
--network host \
-e ZOO_MY_ID=3 zookeeper

The only difference is: -e ZOO_MY_ID=*.

Secondly, open the /usr/soft/zookeeper/conf path on each service, find the zoo.cfg configuration file, and modify it to:

dataDir=/data
dataLogDir=/datalog
tickTime=2000
initLimit=5
syncLimit=2
clientPort=2181
autopurge.snapRetainCount=3
autopurge.purgeInterval=0
maxClientCnxns=60

server.1=172.192.13.10:2888:3888
server.2=172.192.13.11:2888:3888
server.3=172.192.13.12:2888:3888

Then enter one of the servers and enter zk to check whether the configuration is successful:

docker exec -it zookeeper_node /bin/bash

./bin/zkServer.sh status 

Clickhouse cluster deployment

1. Temporary mirror copy configuration

Run a temporary container to store configuration, data, logs and other information on the host:

docker run --rm -d --name=temp-ch yandex/clickhouse-server

Copy the files in the container:

docker cp temp-ch:/etc/clickhouse-server/ /etc/

//https://www.cnblogs.com/EminemJK/p/15138536.html

2. Modify config.xml configuration

//Also compatible with IPV6, once and for all <listen_host>0.0.0.0</listen_host>

//Set the time zone <timezone>Asia/Shanghai</timezone>

//Delete the test information of the original node <remote_servers> <remote_servers incl="clickhouse_remote_servers" />

//Newly added, at the same level as the remote_servers node above <include_from>/etc/clickhouse-server/metrika.xml</include_from>

//Newly added, at the same level as the remote_servers node above <zookeeper incl="zookeeper-servers" optional="true" />

//Newly added, at the same level as the remote_servers node above <macros incl="macros" optional="true" />

For other listen_host, just keep one entry and comment out the others.

3. Copy to another folder

cp -rf /etc/clickhouse-server/ /usr/soft/clickhouse-server/main
cp -rf /etc/clickhouse-server/ /usr/soft/clickhouse-server/sub

main is the primary shard and sub is the replica.

4. Distribute to other servers

#Copy the configuration to server02 scp /usr/soft/clickhouse-server/main/ server02:/usr/soft/clickhouse-server/main/
scp /usr/soft/clickhouse-server/sub/ server02:/usr/soft/clickhouse-server/sub/ 
#Copy the configuration to server03 scp /usr/soft/clickhouse-server/main/ server03:/usr/soft/clickhouse-server/main/
scp /usr/soft/clickhouse-server/sub/ server03:/usr/soft/clickhouse-server/sub/

SCP is really good.

Then you can delete the temporary container: docker rm -f temp-ch

Configuring the cluster

There are three servers here, and each server has two CH instances, which back up each other in a ring to achieve high availability. When resources are sufficient, the replica Sub instance can be completely independent and the configuration can be modified. This is another advantage of Clickhouse, and horizontal expansion is very convenient.

1. Modify the configuration

Enter the server1 server, modify the config.xml file in /usr/soft/clickhouse-server/sub/conf, and modify the following contents:

Original:
<http_port>8123</http_port>
<tcp_port>9000</tcp_port>
<mysql_port>9004</mysql_port>
<interserver_http_port>9009</interserver_http_port>

Modified to:
<http_port>8124</http_port>
<tcp_port>9001</tcp_port>
<mysql_port>9005</mysql_port>
<interserver_http_port>9010</interserver_http_port>

The purpose of the modification is to distinguish it from the configuration of the main shard. The port cannot be applied to two programs at the same time. Server02 and server03 are modified in this way or distributed using the scp command.

2. Add cluster configuration file metrika.xml

server01, main primary shard configuration:

Go to the /usr/soft/clickhouse-server/main/conf folder and add the metrika.xml file (file encoding: utf-8).

<yandex>
    <!-- CH cluster configuration, all servers are the same-->
    <clickhouse_remote_servers>
        <cluster_3s_1r>
            <!-- Data Shard 1 -->
            <shard>
                <internal_replication>true</internal_replication>
                <replica>
                    <host>server01</host>
                    <port>9000</port>
                    <user>default</user>
                    <password></password>
                </replica>
                <replica>
                    <host>server03</host>
                    <port>9001</port>
                    <user>default</user>
                    <password></password>
                </replica>
            </shard>
            <!-- Data Shard 2 -->
            <shard>
                <internal_replication>true</internal_replication>
                <replica>
                    <host>server02</host>
                    <port>9000</port>
                    <user>default</user>
                    <password></password>
                </replica>
                <replica>
                    <host>server01</host>
                    <port>9001</port>
                    <user>default</user>
                    <password></password>
                </replica>
            </shard>
            <!-- Data Shard 3 -->
            <shard>
                <internal_replication>true</internal_replication>
                <replica>
                    <host>server03</host>
                    <port>9000</port>
                    <user>default</user>
                    <password></password>
                </replica>
                <replica>
                    <host>server02</host>
                    <port>9001</port>
                    <user>default</user>
                    <password></password>
                </replica>
            </shard>
        </cluster_3s_1r>
    </clickhouse_remote_servers>

    <!-- All zookeeper_servers instances have the same configuration -->
    <zookeeper-servers>
        <node index="1">
            <host>172.16.13.10</host>
            <port>2181</port>
        </node>
        <node index="2">
            <host>172.16.13.11</host>
            <port>2181</port>
        </node>
        <node index="3">
            <host>172.16.13.12</host>
            <port>2181</port>
        </node>
    </zookeeper-servers>

    <!-- Each instance of marcos has a different configuration -->
    <macros>
        <layer>01</layer>
        <shard>01</shard>
        <replica>cluster01-01-1</replica>
    </macros>
    <networks>
        <ip>::/0</ip>
    </networks>

    <!-- Data compression algorithm-->
    <clickhouse_compression>
        <case>
            <min_part_size>10000000000</min_part_size>
            <min_part_size_ratio>0.01</min_part_size_ratio>
            <method>lz4</method>
        </case>
    </clickhouse_compression>
</yandex>

The <macros> node is different for each server and each instance, and the configurations of other nodes can be the same. The following only lists the configuration differences of the <macros> node.

Server01, sub replica configuration:

<macros>
    <layer>01</layer>
    <shard>02</shard>
    <replica>cluster01-02-2</replica>
</macros>

Server02, main primary shard configuration:

<macros>
    <layer>01</layer>
    <shard>02</shard>
    <replica>cluster01-02-1</replica>
</macros>

Server02, sub replica configuration:

<macros>
    <layer>01</layer>
    <shard>03</shard>
    <replica>cluster01-03-2</replica>
</macros>

Server03, main primary shard configuration:

<macros>
    <layer>01</layer>
    <shard>03</shard>
    <replica>cluster01-03-1</replica>
</macros>

Server03, sub replica configuration:

<macros>
    <layer>01</layer>
    <shard>02</shard>
    <replica>cluster01-01-2</replica>
</macros>

At this point, all configurations have been completed. Other configurations, such as passwords, can be added as needed.

Cluster operation and testing

Run the instance on each server in turn. Zookeeper has been running before. If not, you need to run the zk cluster first.

Run the main instance:

docker run -d --name=ch-main -p 8123:8123 -p 9000:9000 -p 9009:9009 --ulimit nofile=262144:262144 \ -v /usr/soft/clickhouse-server/main/data:/var/lib/clickhouse:rw \ -v /usr/soft/clickhouse-server/main/conf:/etc/clickhouse-server:rw \ -v /usr/soft/clickhouse-server/main/log:/var/log/clickhouse-server:rw \
--add-host server01:172.192.13.10 \
--add-host server02:172.192.13.11 \
--add-host server03:172.192.13.12 \
--hostname server01 \
--network host \
--restart=always \
 yandex/clickhouse-server

Run the sub instance:

docker run -d --name=ch-sub -p 8124:8124 -p 9001:9001 -p 9010:9010 --ulimit nofile=262144:262144 \
-v /usr/soft/clickhouse-server/sub/data:/var/lib/clickhouse:rw \
-v /usr/soft/clickhouse-server/sub/conf:/etc/clickhouse-server:rw \
-v /usr/soft/clickhouse-server/sub/log:/var/log/clickhouse-server:rw \
--add-host server01:172.192.13.10 \
--add-host server02:172.192.13.11 \
--add-host server03:172.192.13.12 \
--hostname server01 \
--network host \
--restart=always \
 yandex/clickhouse-server

When executing the command on each server, the only different parameter is the hostname, because we have previously set the hostname to specify the server. Otherwise, when executing select * from system.clusters to query the cluster, the is_local column will be all 0, indicating that the local service cannot be found. This is something that needs attention. After each server instance is started, use the genuine DataGrip to open it:

Create a new query on any instance:

create table T_UserTest on cluster cluster_3s_1r
(
    ts DateTime,
    uid String,
    biz String
)
    engine = ReplicatedMergeTree('/clickhouse/tables/{layer}-{shard}/T_UserTest', '{replica}')
        PARTITION BY toYYYYMMDD(ts)
        ORDER BY ts
        SETTINGS index_granularity = 8192;

cluster_3s_1r is the name of the cluster configured earlier. They must correspond one to one. /clickhouse/tables/ is a fixed prefix. For related syntax, see the official documentation.

Refresh each instance and you can see that all instances have this T_UserTest table. Because Zookeeper has been set up, it is easy to implement distributed DDL.

Continue to create a new Distributed table:

CREATE TABLE T_UserTest_All ON CLUSTER cluster_3s_1r AS T_UserTest ENGINE = Distributed(cluster_3s_1r, default, T_UserTest, rand())

Each primary shard inserts relevant information separately:

--server01insert into T_UserTest values ​​('2021-08-16 17:00:00',1,1)
--server02
insert into T_UserTest values ​​('2021-08-16 17:00:00',2,1)
--server03
insert into T_UserTest values ​​('2021-08-16 17:00:00',3,1)

Then query the distributed table select * from T_UserTest_All,

Querying the corresponding replica table or shutting down the docker instance of one of the servers will not affect the query. This is not tested due to time constraints.

This is the end of this article about Clickhouse Docker cluster configuration and deployment. For more related Clickhouse Docker cluster content, please search for previous articles on 123WORDPRESS.COM or continue to browse the following related articles. I hope you will support 123WORDPRESS.COM in the future!

You may also be interested in:
  • Springboot uses clickhouse real-time big data analysis engine (usage method)
  • Two ways to visualize ClickHouse data using Apache Superset
  • How to connect Mybatis to ClickHouse
  • SpringBoot2 integrated ClickHouse database case analysis
  • Data analysis database ClickHouse application practice in the field of big data

<<:  CSS removes the dotted border generated when clicking a link. Compatible with browsers that meet W3C standards

>>:  Using js to implement simple switch light code

Recommend

Vue implements a visual drag page editor

Table of contents Drag and drop implementation Dr...

IDEA configuration process of Docker

IDEA is the most commonly used development tool f...

Vue.js handles Icon icons through components

Icon icon processing solution The goal of this re...

How to design a web page? How to create a web page?

When it comes to understanding web design, many p...

Summary of MySQL composite indexes

Table of contents 1. Background 2. Understanding ...

Native js implements custom scroll bar component

This article example shares the specific code of ...

Detailed explanation of single-row function code of date type in MySQL

Date-type single-row functions in MySQL: CURDATE(...

How to generate Vue user interface by dragging and dropping

Table of contents Preface 1. Technical Principle ...

Detailed explanation of MySQL custom functions and stored procedures

Preface This article mainly introduces the releva...

Example analysis of the usage of the new json field type in mysql5.7

This article uses an example to illustrate the us...

Common array operations in JavaScript

Table of contents 1. concat() 2. join() 3. push()...

React implements infinite loop scrolling information

This article shares the specific code of react to...