Teach you how to build a Hadoop 3.x pseudo cluster on Tencent Cloud

1. Environmental Preparation

CentOS Linux release 7.5.1804 (Core)

Install

Create a folder

$ cd /home/centos
$ mkdir software
$ mkdir module

Import the installation package into the software folder

$ cd software
# Then drag the file in

The installation package used here is

/home/centos/software/hadoop-3.1.3.tar.gz

/home/centos/software/jdk-8u212-linux-x64.tar.gz

$ tar -zxvf jdk-8u212-linux-x64.tar.gz -C ../module
$ tar -zxvf hadoop-3.1.3.tar.gz -C ../module

Configuring environment variables

$ cd /etc/profile.d/
$ vim my_env.sh

In order not to pollute system variables, we create an environment variable script ourselves, the configuration content is as follows

#JAVA_HOME,PATH 
# export is promoted to a global variable. If your path is different from mine, remember to use your own path here export JAVA_HOME=/home/centos/module/jdk1.8.0_212
export PATH=$PATH:$JAVA_HOME/bin

#HADOOP_HOME
export HADOOP_HOME=/home/centos/module/hadoop-3.1.3
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin

Then save and exit (if you don’t know how to use vim, you can read the basic usage of vim, I won’t go into details here).

Let's source it to make the environment variables take effect

$ source /etc/profile

Test it to see if it succeeds

$ hadoop version

$ java

If the above interface appears, there is no problem. If it is still not successful, you can do the following two checks:

Go to the bin directory under the installation directory of java and hadoop, run them respectively, and see if they are successful. If it fails, it means that there is a problem with the decompression of the installation package and the software itself has not been installed successfully. Delete and reinstall.
If the operation is successful, it means that the environment variables are not configured successfully. Then you can check the path setting of the environment variable. If there is no problem, try restarting.

ssh without password

Although it is a pseudo cluster, a password is still required when the local machine connects to the local machine, so you need to set up ssh password-free

$ ssh-keygen -t rsa

Just keep pressing Enter when the prompt appears. After generating the secret key

$ ssh-copy-id local hostname

Configure hosts file

vi /etc/hosts
#The configuration I keep here is that the master is configured with the intranet of Tencent Cloud. If the external network is configured, the eclipse client will not be able to connect to hadoop
::1 localhost.localdomain localhost
::1 localhost6.localdomain6 localhost6
172.16.0.3 master
127.0.0.1 localhost

Modify the host name

vi /etc/sysconfig/network 
#Change HOSTNAME to master
HOSTNAME=master

Modify hostname

$ hostnamectl --static set-hostname master

Turn off firewall

$ systemctl disable firewalld #Permanent

2. Configure Hadoop

Configuration Files

Enter the hadoop configuration file area, all configuration files are in this folder

$ cd /home/centos/module/hadoop-3.1.3/etc/hadoop

The files we want to configure are mainly

core-site.xml

fs.defaultFS is the access path of the local machine;
hadoop.tmp.dir is the data storage path
If you don’t know the intranet address, check it on the Tencent Cloud website.

hdfs-site.xml

dfs.replication refers to the number of copies of the data, the default is 3
We set it to 1 because it is a pseudo cluster.

yarn-site.xml
mapred-site.xml
hadoop-env.sh

expert JAVA_HOME=your jdk installation path

Then just follow the steps!

$ vim core-site.xml

<configuration>
  <property>
    <name>fs.defaultFS</name>
    <value>hdfs://Tencent Cloud intranet ip address:9820</value>
  </property>
 
  <property>
    <name>hadoop.tmp.dir</name>
    <value>/home/centos/module/hadoop-3.1.3/data/tmp</value>
  </property>
	<!-- Permissions to operate HDFS through the web interface -->
  <property>
    <name>hadoop.http.staticuser.user</name>
    <value>root</value>
  </property>
    <!-- Hive compatibility configuration later -->
  <property>
    <name>hadoop.proxyuser.root.hosts</name>
    <value>*</value>
  </property>
  <property>
    <name>hadoop.proxyuser.root.groups</name>
    <value>*</value>
    </property>
</configuration>

$ vim hdfs-site.xml

<configuration>
  <property>
    <name>dfs.replication</name>
    <value>1</value>
  </property>
  <property>
    <name>dfs.namenode.secondary.http-address</name>
    <value>Tencent Cloud intranet IP address: 9868</value>
  </property>
</configuration>

$ vim hadoop-env.sh

export JAVA_HOME=/home/centos/module/jdk1.8.0_212

$ vim yarn-site.xml

<configuration>

  <!-- Reducer obtains data -->
  <property>
    <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce_shuffle</value>
  </property>
  <!-- Specify the address of YARN's ResourceManager -->
  <property>
    <name>yarn.resourcemanager.hostname</name>
    <value>master</value>
  </property>
<!-- Environment variables are inherited from the NodeManagers' container environment properties. For mapreduce applications, in addition to the default value hadoop op_mapred_home should be added. The attribute values are as follows -->
  <property>
    <name>yarn.nodemanager.env-whitelist</name>
 <value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value>
    </property>
    <!-- Solve the problem that Yarn exceeds the virtual memory limit when executing the program and the Container is killed -->
  <property>
    <name>yarn.nodemanager.pmem-check-enabled</name>
    <value>false</value>
  </property>
  <property>
    <name>yarn.nodemanager.vmem-check-enabled</name>
    <value>false</value>
  </property>
    <!-- Hive compatibility configuration later -->
  <property>
    <name>yarn.scheduler.minimum-allocation-mb</name>
    <value>512</value>
  </property>
  <property>
    <name>yarn.scheduler.maximum-allocation-mb</name>
    <value>4096</value>
  </property>
  <property>
    <name>yarn.nodemanager.resource.memory-mb</name>
    <value>4096</value>
    </property>
  
  <!-- Enable log aggregation-->
  <property>
    <name>yarn.log-aggregation-enable</name>
    <value>true</value>
  </property>
  <!-- Access path -->
  <property> 
    <name>yarn.log.server.url</name> 
    <value>http://172.17.0.13:19888/jobhistory/logs</value>
  </property>
  <!-- Save for 7 days -->
  <property>
    <name>yarn.log-aggregation.retain-seconds</name>
    <value>604800</value>
  </property>
</configuration>

Configuring the History Server

$ vim mapred-site.xml

<!-- History server address-->
<property>
  <name>mapreduce.jobhistory.address</name>
  <value>Tencent Cloud intranet ip:10020</value>
</property>

<!-- History server web address-->
<property>
  <name>mapreduce.jobhistory.webapp.address</name>
  <value>Tencent Cloud intranet ip:19888</value>
</property>

initialization

The NameNode needs to be formatted for the first startup, but it is not required afterwards.

$ hdfs namenode -format

After initialization, you can see that two folders, data and logs, appear in the hadoop installation folder, which means that the initialization is successful.

Next, let’s start the cluster.

$ start-dfs.sh

Startup completed, no abnormal information, check the process

[root@VM_0_13_centos hadoop]# jps
20032 Jps
30900 DataNode
31355 SecondaryNameNode
30559 NameNode

All started successfully~!

One-click start

If all the above are OK, you can create a script to start the cluster with one click and create a new one in the bin directory.

$ vim mycluster

Add the following content

#!/bin/bash
case $1 in
"start")
#dfs yarn history
start-dfs.sh
start-yarn.sh
mapred --daemon start historyserver
;;
"stop")
# dfs yarn history
stop-dfs.sh
stop-yarn.sh
mapred --daemon stop historyserver
;;
*)
echo "args is error! please input start or stop"
;;
esac

Configure script permissions

$ chmod u+x mycluster

Start using script

$ mycluster start

$ jps
23680 NodeManager
24129 JobHistoryServer
22417 DataNode
24420 Jps
22023 NameNode
23384 ResourceManager
22891 SecondaryNameNode

3. View hdfs

Configure security group rules

Before performing the following operations, add the following ports to be used in the protocol port in the security group rules:

Port Number:

Namenode ports: 9870
Secondary NN ports: 9868
Job History: 19888

hadoop web page

Enter the following in the browser:騰訊云公網地址:端口號to enter the corresponding web interface

We found that the Secondary NameNode interface display was not normal. This was due to the incorrect use of the time function of dfs-dust.js in hadoop3. Let’s correct it manually.

First shut down the cluster

$ mycluster stop

Modify the file

$ vim /home/centos/module/hadoop-3.1.3/share/hadoop/hdfs/webapps/static/dfs-dust.js

At about line 61, as shown in the figure, change to:

return new Date(Number(v)).toLocaleString();

Now we restart the cluster

$ mycluster start

You can see that the web interface of Secondary NameNode is normal.

Testing HDFS

Let's upload the file and have some fun.

Create a new folder in the hadoop directory

$ mkdir temdatas

Enter the folder and create a new test file

$ vim text.txt

Just write whatever you want, save it, and then we can start uploading files.

$ hdfs dfs -put text.txt /

Check the web page and upload successfully~

Try downloading this file again

$ hdfs dfs -get /text.txt ./text1.txt

Success~
Now that the Hadoop cluster has been built, you can do some fun things yourself~!

WordCount Case Study

Create a new folder input on the web

Upload a file of various words you wrote and do word statistics

#Or you can write it in vim and upload it yourself $ hdfs dfs -put wordcount.txt /input

Then test the wordcount case. Note that the output folder cannot exist.

$ hadoop jar /home/centos/module/hadoop-3.1.3/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.3.jar wordcount /input /output

After running, let's take a look at the results

#Pull hdfs file [root@master mydata]# hdfs dfs -get /output ./
# View the results [root@master output]# cat part-r-00000 
a 2
b 3
c 2
d 1
e 1
f 1

At this point, you can play around with Hadoop freely.

Of course, if you have tried it, you will find that there is still a small problem that has not been solved. When you click on the file on the web to view the head or tail, it will not be able to be viewed, and downloading is also not possible. This did not happen when the virtual machine was installed, and I am still investigating what happened. If anyone knows what's going on, please leave a message.

This is the end of this article on how to build a Hadoop 3.x pseudo cluster on Tencent Cloud. For more information about how to build a Hadoop 3.x pseudo cluster on Tencent Cloud, please search for previous articles on 123WORDPRESS.COM or continue to browse the following related articles. I hope you will support 123WORDPRESS.COM in the future!

You may also be interested in:

Detailed example of Hadoop multi-job parallel processing
Detailed explanation of common hadoop errors and solutions
How to configure Hadoop to use IntelliJ IDEA for remote debugging code
Detailed tutorial on how to use Hadoop integrated with Spring (quick start with big data)
Detailed method of using IDEA to build Hadoop development environment under Windows
CentOS 7 builds hadoop 2.10 high availability (HA)
How to run Hadoop and create images in Docker
Teach you how to use hadoop to extract specified content from a file

<<: Vue realizes the progress bar change effect

>>: MySQL transaction concepts and usage in-depth explanation