Detailed tutorial on deploying Hadoop cluster using Docker

Detailed tutorial on deploying Hadoop cluster using Docker

Recently, I want to build a hadoop test cluster in my company, so I use docker to quickly deploy the hadoop cluster.

0. Write in front

There are already many tutorials on the Internet, but there are many pitfalls in them. Here I will record my own installation process.

Objective: Use Docker to build a cluster of Hadoop 2.7.7 with one master and two slaves.

Prepare:

First, you need a centos7 machine with more than 8G memory. I use Alibaba Cloud host.

Secondly, upload the jdk and hadoop packages to the server.

I installed hadoop2.7.7. The package is ready for everyone, link: https://pan.baidu.com/s/15n_W-1rqOd2cUzhfvbkH4g extraction code: vmzw.

1. Steps

It can be roughly divided into the following steps:

  • Install Docker
  • Basic environment preparation
  • Configure the network and start the docker container
  • Configure host and ssh password-free login
  • Install and configure hadoop

1.1 Install Docker

Follow the steps below to install Docker. If you have a Docker environment, you can skip this step.

yum update

yum install -y yum-utils device-mapper-persistent-data lvm2

yum-config-manager --add-repo http://mirrors.aliyun.com/docker-ce/linux/centos/docker-ce.repo

yum install -y docker-ce
 
systemctl start docker

docker -v

1.2 Basic Environment Preparation

1.2.1 Create a basic centos7 image and pull the official centos7 image

docker pull centos

Generate a centos image with ssh function by building Dockfile

Create a Dockerfile

vi Dockerfile

Write the following content into Dockerfile

FROM centos
MAINTAINER mwf

RUN yum install -y openssh-server sudo
RUN sed -i 's/UsePAM yes/UsePAM no/g' /etc/ssh/sshd_config
RUN yum install -y openssh-clients

RUN echo "root:qwe123" | chpasswd
RUN echo "root ALL=(ALL) ALL" >> /etc/sudoers
RUN ssh-keygen -t dsa -f /etc/ssh/ssh_host_dsa_key
RUN ssh-keygen -t rsa -f /etc/ssh/ssh_host_rsa_key

RUN mkdir /var/run/sshd
EXPOSE 22
CMD ["/usr/sbin/sshd", "-D"]

The above content probably means: based on the centos image, set the password to wqe123, install the ssh service and start it

Building the Dockerfile

docker build -t="centos7-ssh" .

A mirror named centos7-ssh will be generated, which can be viewed through docker images

1.2.2 Generate an image with hadoop and jdk environment

  • Put the prepared package in the current directory. hadoop-2.7.7.tar.gz and jdk-8u202-linux-x64.tar.gz
  • Generate a centos image with hadoop and jdk environment by building Dockfile

A Dockerfile has just been created, so let’s move it out of the way. mv Dockerfile Dockerfile.bak

Create Dockerfile

vi Dockerfile

Write the following:

FROM centos7-ssh
ADD jdk-8u202-linux-x64.tar.gz /usr/local/
RUN mv /usr/local/jdk1.8.0_202 /usr/local/jdk1.8
ENV JAVA_HOME /usr/local/jdk1.8
ENV PATH $JAVA_HOME/bin:$PATH

ADD hadoop-2.7.7.tar.gz /usr/local
RUN mv /usr/local/hadoop-2.7.7 /usr/local/hadoop
ENV HADOOP_HOME /usr/local/hadoop
ENV PATH $HADOOP_HOME/bin:$PATH

RUN yum install -y which sudo

The above content roughly means: based on the centos7-ssh generated above, put the hadoop and jdk packages in, and then configure the environment variables.

Building the Dockerfile

docker build -t="hadoop" .

A mirror named hadoop will be generated

1.3 Configure the network and start the docker container

Because the clusters must be connected via the network, the network must be configured first.

Creating a network

docker network create --driver bridge hadoop-br

The above command creates a bridge network named hadoop-br

Specify the network when starting Docker

docker run -itd --network hadoop-br --name hadoop1 -p 50070:50070 -p 8088:8088 hadoop
docker run -itd --network hadoop-br --name hadoop2 hadoop
docker run -itd --network hadoop-br --name hadoop3 hadoop

The above command starts three machines, the network is specified as hadoop-br , and port mapping is enabled for hadoop1.

Check network status

docker network inspect hadoop-br

Execute the above command to see the corresponding network information:

[
  {
    "Name": "hadoop-br",
    "Id": "88b7839f412a140462b87a353769e8091e92b5451c47b5c6e7b44a1879bc7c9a",
    "Containers": {
"86e52eb15351114d45fdad4462cc2050c05202554849bedb8702822945268631": {
        "Name": "hadoop1",
        "IPv4Address": "172.18.0.2/16",
        "IPv6Address": ""
      },
      "9baa1ff183f557f180da2b7af8366759a0d70834f43d6b60fba2e64f340e0558": {
        "Name": "hadoop2",
        "IPv4Address": "172.18.0.3/16",
        "IPv6Address": ""
      }, "e18a3166e965a81d28b4fe5168d1f0c3df1cb9f7e0cbe0673864779b224c8a7f": {
        "Name": "hadoop3",
        "IPv4Address": "172.18.0.4/16",
        "IPv6Address": ""
      }
    },
  }
]

We can find out the IP addresses of the three machines:

172.18.0.2 hadoop1 
172.18.0.3 hadoop2 
172.18.0.4 hadoop3

Log in to the docker container and you can ping each other.

docker exec -it hadoop1 bash
docker exec -it hadoop2 bash
docker exec -it hadoop3 bash

1.4 Configure host and ssh password-free login

1.4.1 Configuring the host

Modify the host of each machine separately

vi /etc/hosts

Write the following content (Note: the IP allocated by Docker may be different for each person, fill in your own):

172.18.0.2 hadoop1 
172.18.0.3 hadoop2 
172.18.0.4 hadoop3

1.4.2 SSH password-free login

Because the ssh service has been installed in the image above, execute the following commands directly on each machine:

ssh-keygen
Press Enter all the way ssh-copy-id -i /root/.ssh/id_rsa -p 22 root@hadoop1
Enter the password, if mine is qwe123
ssh-copy-id -i /root/.ssh/id_rsa -p 22 root@hadoop2
Enter the password, if mine is qwe123
ssh-copy-id -i /root/.ssh/id_rsa -p 22 root@hadoop3
Enter the password, if mine is qwe123

1.4.3 Test whether the configuration is successful

ping hadoop1 
ping hadoop2
ping hadoop3
ssh hadoop1
ssh hadoop2
ssh hadoop3

1.5 Install and configure Hadoop

1.5.1 Operation on hadoop1

Enter hadoop1

docker exec -it hadoop1 bash

Create some folders, which will be used in the configuration later

mkdir /home/hadoop
mkdir /home/hadoop/tmp /home/hadoop/hdfs_name /home/hadoop/hdfs_data

Switch to the hadoop configuration directory

cd $HADOOP_HOME/etc/hadoop/

Edit core-site.xml

<property>
    <name>fs.defaultFS</name>
    <value>hdfs://hadoop1:9000</value>
  </property>
  <property>
    <name>hadoop.tmp.dir</name>
    <value>file:/home/hadoop/tmp</value>
  </property>
  <property>
    <name>io.file.buffer.size</name>
    <value>131702</value>
  </property>

Edit hdfs-site.xml

 <property>
    <name>dfs.namenode.name.dir</name>
    <value>file:/home/hadoop/hdfs_name</value>
  </property>
  <property>
    <name>dfs.datanode.data.dir</name>
    <value>file:/home/hadoop/hdfs_data</value>
  </property>
  <property>
    <name>dfs.replication</name>
    <value>2</value>
  </property>
  <property>
    <name>dfs.namenode.secondary.http-address</name>
    <value>hadoop1:9001</value>
  </property>
  <property>
    <name>dfs.webhdfs.enabled</name>
    <value>true</value>
  </property>

Edit mapred-site.xml

mapred-site.xml does not exist by default. To do this, execute cp mapred-site.xml.template mapred-site.xml

 <property>
    <name>mapreduce.framework.name</name>
    <value>yarn</value>
  </property>
  <property>
    <name>mapreduce.jobhistory.address</name>
    <value>hadoop1:10020</value>
  </property>
  <property>
    <name>mapreduce.jobhistory.webapp.address</name>
    <value>hadoop1:19888</value>
  </property>

Edit yarn-site.xml

 <property>
    <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce_shuffle</value>
  </property>
  <property>
    <name>yarn.nodemanager.auxservices.mapreduce.shuffle.class</name>
    <value>org.apache.hadoop.mapred.ShuffleHandler</value>
  </property>
  <property>
    <name>yarn.resourcemanager.address</name>
    <value>hadoop1:8032</value>
  </property>
  <property>
    <name>yarn.resourcemanager.scheduler.address</name>
    <value>hadoop1:8030</value>
  </property>
  <property>
    <name>yarn.resourcemanager.resource-tracker.address</name>
    <value>hadoop1:8031</value>
  </property>
  <property>
    <name>yarn.resourcemanager.admin.address</name>
    <value>hadoop1:8033</value>
  </property>
  <property>
    <name>yarn.resourcemanager.webapp.address</name>
    <value>hadoop1:8088</value>
  </property>

Edit slaves

Here I use hadoop1 as the master node and hadoop2 and 3 as slave nodes

hadoop2
hadoop3

Copy the file to hadoop2 and hadoop3

Execute the following commands in sequence:

scp -r $HADOOP_HOME/hadoop2:/usr/local/
scp -r $HADOOP_HOME/hadoop3:/usr/local/

scp -r /home/hadoop hadoop2:/
scp -r /home/hadoop hadoop3:/

1.5.2 Operation on each machine

Connect to each machine separately

docker exec -it hadoop1 bash
docker exec -it hadoop2 bash
docker exec -it hadoop3 bash

Configure the environment variables of the hadoop sbin directory

Because the hadoop bin directory was configured when the image was created before, but the sbin directory was not, it needs to be configured separately. Assign configuration to each machine:

vi ~/.bashrc

Append the following content:

export PATH=$PATH:$HADOOP_HOME/sbin

implement:

source ~/.bashrc

1.5.3 Start Hadoop

Execute the following command on hadoop1:

Formatting hdfs

hdfs namenode -format

One-click start

start-all.sh

If you don't make any mistakes, you can celebrate. If you make a mistake, come on.

1.6 Testing using hadoopjps

#hadoop1
1748 Jps
490 NameNode
846 ResourceManager
686 SecondaryNameNode

#hadoop2
400 DataNode
721 Jps
509 NodeManager

#hadoop3
425 NodeManager
316 DataNode
591 Jps

Upload files

hdfs dfs -mkdir /mwf

echo hello > a.txt
hdfs dfs -put a.txt /mwf

hdfs dfs -ls /mwf

Found 1 items
drwxr-xr-x - root supergroup 0 2020-09-04 11:14 /mwf

Since it is a cloud server, I don’t want to configure the port, so I won’t look at the UI interface.

2. Finally

The above is the process I summarized after the successful installation. There should be no problems, but there may be omissions.

3. References

https://cloud.tencent.com/developer/article/1084166

https://cloud.tencent.com/developer/article/1084157?from=10680

https://blog.csdn.net/ifenggege/article/details/108396249

This is the end of this article about the detailed tutorial on how to deploy a Hadoop cluster using Docker. For more information about deploying a Hadoop cluster using Docker, please search for previous articles on 123WORDPRESS.COM or continue to browse the following related articles. I hope you will support 123WORDPRESS.COM in the future!

You may also be interested in:
  • How to build a Hadoop cluster environment with ubuntu docker
  • Detailed explanation of building hadoop and hbase cluster with docker
  • Detailed explanation of how to quickly build a Hadoop cluster environment using Docker from scratch
  • Detailed explanation of using docker to build a Hadoop distributed cluster

<<:  Detailed explanation of how to write mysql not equal to null and equal to null

>>:  jQuery to achieve sliding stairs effect

Recommend

CSS solution for centering elements with variable width and height

1. Horizontal center Public code: html: <div c...

A more elegant error handling method in JavaScript async await

Table of contents background Why error handling? ...

Four data type judgment methods in JS

Table of contents 1. typeof 2. instanceof 3. Cons...

Docker Gitlab+Jenkins+Harbor builds a persistent platform operation

CI/CD Overview CI workflow design Git code versio...

Background image cache under IE6

CSS background image flickering bug in IE6 (backg...

Enabling or disabling GTID mode in MySQL online

Table of contents Basic Overview Enable GTID onli...

Thinking about grid design of web pages

<br />Original address: http://andymao.com/a...

MySQL log system detailed information sharing

Anyone who has worked on a large system knows tha...

Common HTML tag writing errors

We better start paying attention, because HTML Po...

How to recover data after accidentally deleting ibdata files in mysql5.7.33

Table of contents 1. Scenario description: 2. Cas...