Hbase Getting Started

Hbase Getting Started

1. HBase Overview

1.1 What is HBase

HBase is a NoSQL database based on HDFS for distributed data storage, with high reliability, high performance, column storage, scalability, and real-time reading and writing.

Hbase can store massive amounts of data, and has high query performance in the later stages, and can return results in seconds for queries of hundreds of millions of data items.

1.2 Characteristics of HBase Tables

1. Large

  • Hbase tables can store massive amounts of data.

2. No Mode

  • The fields of each row and column in the MySQL table are the same, while each row of data in the HBase table can have completely different columns.

3. Column-oriented

  • The data in the hbase table can have many columns. Later, it will store data according to different columns and write them into different files.
  • Store data in column families.

4. Sparse

  • Columns that are null in an HBase table do not occupy actual storage space.

5. Multiple versions of data

  • When updating the data in the hbase table, it does not directly delete the previous result data, but retains multiple versions of the data. Each data is given a version number, which is determined according to the timestamp when we insert the data.

6. Single data type

  • No matter what type of data, it is finally converted into a byte array and stored in the hbase table

1.3 Logical view of hbase table

2. HBase cluster structure

1. client

  • Provides some Java interfaces for operating HBase tables.
  • The client maintains some cache to speed up access to hbase
  • The client will save and cache the queried location information, and the cache will not automatically expire

2. Zookeeper

The client needs a zk cluster to operate hbase table data

effect

1. zk saves the metadata information of the hbase cluster

Stores the Hbase schema, including which tables there are and which column families each table has

2. zk saves the addressing entry of all hbase tables

When you use the client interface to operate HBase data later, you need to connect to the ZK cluster to store the addressing entry of all Regions-which server is the root table on?

3. After introducing zk, the entire hbase cluster is highly available

4. zk saves the registration and heartbeat information of HMaster and HRegionServer

If an HRegionServer fails later, ZooKeeper will detect it and notify the boss HMaster of this information.

3. HMaster

It is the boss of the entire hbase cluster

effect

1. It accepts client requests to create and delete tables. Handling schema update requests

2. It will assign the corresponding region to HRegionServer to manage the data

3. It will reallocate the regions managed by the failed HRegionServer to other live HRegionServers

4. It will achieve HRegionServer load balancing to avoid too many regions managed by a certain HRegionServer.

4. HRegionServer

It is the younger brother of the integrated hbase cluster

effect

1. Responsible for managing the region assigned to it by HMaster

2. It will accept the client's read and write requests

3. It will split the region data that becomes too large during operation

5. Region

It is the smallest unit of distributed storage in the entire HBase table.

Its data is stored based on hdfs

3. HBase cluster installation and deployment

Prerequisites

  • First build the zk and hadoop clusters

1. Download the corresponding installation package

  • http://archive.apache.org/dist/hbase/1.2.1/hbase-1.2.1-bin.tar.gz
  • hbase-1.2.1-bin.tar.gz

2. Plan the installation directory

  • /export/servers

3. Upload the installation package to the server

4. Unzip the installation package to the specified planning directory

  • tar -zxvf hbase-1.2.1-bin.tar.gz -C /export/servers

5. Rename the decompression directory

  • mv hbase-1.2.1 hbase

6. Modify the configuration file

You need to put the hadoop installation directory in the /etc/hadoop folder

  • core-site.xml
  • hdfs-site.xml

You need to copy the above two hadoop configuration files to the conf folder under the hbase installation directory

1. vim hbase-env.sh

#Configure java environment variables export JAVA_HOME=/export/servers/jdk
#Specify that the hbase cluster is managed by an external zk cluster, and do not use the built-in zk cluster export HBASE_MANAGES_ZK=false

2. vim hbase-site.xml

       <!-- Specify the path where hbase is stored on HDFS -->
    <property>
        <name>hbase.rootdir</name>
<value>hdfs://node1:9000/hbase</value>
    </property>
        <!-- Specify that hbase is distributed -->
    <property>
<name>hbase.cluster.distributed</name>
        <value>true</value>
    </property>
        <!-- Specify the address of zk, separate multiple addresses with “,” -->
    <property>
        <name>hbase.zookeeper.quorum</name>
<value>node1:2181,node2:2181,node3:2181</value>
    </property>

3. vim regionservers

#Specify which nodes are HRegionServer
node2
node3

4. vim backup-masters

#Specify which nodes are standby Hmasters
node2

7. Configure hbase environment variables

vim /etc/profile

export HBASE_HOME=/export/servers/hbase
export PATH=$PATH:$HBASE_HOME/bin

8. Distribute hbase directories and environment variables

scp -r hbase node2:/export/servers
scp -r hbase node3:/export/servers
scp /etc/profile node2:/etc
scp /etc/profile node3:/etc

9. Make the environment variables of all hbase nodes effective

Execute on all nodes

  • source /etc/profile

4. Start and stop the hbase cluster

1. Start the hbase cluster

Start the zk and hadoop clusters first

Then through hbase/bin

start-hbase.sh

  • Where do you start this script? First, start an HMaster process on the current machine (it is the living HMaster)
  • Start HRegionServer on the corresponding node through the regionservers file
  • Start the standby HMaster on the corresponding node through the backup-masters file

2. Stop the hbase cluster

Via hbase/bin

stop-hbase.sh

hbase cluster web management interface

1. After starting the hbase cluster

Access address

HMaster host name: 16010

5. Hbase shell command line operation

hbase/bin/hbase shell Enter the hbase shell client command operation

1. Create a table

create 't_user_info','base_info','extra_info'
create 't1', {NAME => 'f1'}, {NAME => 'f2'}, {NAME => 'f3'}

2. Check which tables are there

list
Similar to sql:show tables in mysql table

3. View the description information of the table

describe 't_user_info'

4. Modify table properties

#Modify the maximum number of versions of the column family alter 't_user_info', NAME => 'base_info', VERSIONS => 3

5. Add data to the table

put 't_user_info','00001','base_info:name','zhangsan'
put 't_user_info','00001','base_info:age','30'
put 't_user_info','00001','base_info:address','beijing'
put 't_user_info','00001','extra_info:school','shanghai'
put 't_user_info','00002','base_info:name','lisi'

6. Query table data

//Query according to the conditions get 't_user_info','00001'
get 't_user_info','00001', {COLUMN => 'base_info'}
get 't_user_info','00001', {COLUMN => 'base_info:name'}
get 't_user_info','00001',{TIMERANGE => [1544243300660,1544243362660]}
get 't_user_info','00001',{COLUMN => 'base_info:age',VERSIONS =>3}
//Full table query scan 't_user_info'

7. Deleting Data

delete 't_user_info','00001','base_info:name'
deleteall 't_user_info','00001'

8. Delete table

disable 't_user_info'
drop 't_user_info'

6. The internal principle of hbase

  • All rows in the table are arranged in lexicographical order by row key.
  • Table is divided into multiple Hregions in the row direction
  • Regions are divided by size (10G by default). Each table has only one region at the beginning. The region keeps growing. When it grows to a threshold, the Hregion will be equally divided into two new Hregions. As the number of rows in the table increases, there will be more and more Hregions.
  • Hregion is the smallest unit of distributed storage and load balancing in Hbase. The smallest unit means that different Hregions can be distributed on different HRegion servers.
  • Although HRegion is the smallest unit of load balancing, it is not the smallest unit of physical storage. HRegion consists of one or more Stores, and each store stores a column family. Each Strore is composed of a memStore and 0 to multiple StoreFiles. Write operations are first written to memstore. When the amount of data in memstore reaches a certain threshold (128M or 1 hour by default), Hregionserver starts the flashcache process to write to storefile. Each write forms a separate storefile.
  • When the number of storefiles exceeds a certain threshold (default parameter hbase.hstore.blockingStoreFiles=10), multiple storeFiles will be merged. When the sum of the storefile sizes of all stores in the region, that is, the size of all stores, exceeds hbase.hregion.max.filesize=10G, the region will be split and the current region will be divided into two. The Hmaster will assign them to the corresponding region servers to achieve load balancing.
  • Each HRegionServer has an HLog object. HLog is a class that implements Write Ahead Log. Every time a user writes data to MemStore, a copy of the data is also written to the HLog file. The HLog file will periodically roll out new files and delete old files (data that have been persisted to StoreFile). When HRegionServer terminates unexpectedly, HMaster will perceive it through Zookeeper. HMaster will first process the remaining HLog files, split the Log data of different Regions, and put them into the directories of the corresponding regions. Then, the failed regions will be reallocated. The HRegionServer that receives these regions will find that there are historical HLogs that need to be processed during the Load Region process. Therefore, it will replay the data in the HLog to the MemStore, and then flush it to the StoreFiles to complete the data recovery.

7. HBase addressing mechanism

Finding RegionServer

  • ZooKeeper–> -ROOT-(single Region)–> .META.–> user table

-ROOT-Table

  • The table contains the region list where the .META. table is located. The table will only have one region.
  • The root region will never be split, ensuring that at most three jumps are required to locate any region.
  • Zookeeper records the location of the -ROOT- table.

.META. table

  • The table contains a list of all user space regions and the server address of the RegionServer
  • Each row of the .META. table stores the location information of a region, and the row key is composed of the table name + the last row of the table.
  • To speed up access, all regions of the .META. table are stored in memory.

Contact regionserver to query target data

The regionserver locates the region where the target data is located and issues a query request

region is searched in memstore first, and returned if it matches

If it is not found in the memstore, it scans in the storefile (it may scan many storefiles---bloomfilter). The bloom filter can quickly return whether the queried rowkey is in this storefile, but there are also errors. If it returns no, it must not be there. If it returns yes, it may not be there.

8. Hbase Advanced Applications

Create a table

BLOOMFILTER defaults to Row Bloom filter

  • For ROW, a hash of the row key is added to the Bloom each time a row is inserted.
  • For ROWCOL, a hash of the row key + column family + column family modifier will be added to the Bloom table each time a row is inserted.

VSRSIONS defaults to 1 data version

  • If we think that there is no need to keep so much data, and that it is updated at any time, and the old version of the data is of no value to us, then setting this parameter to 1 can save 2/3 of the space.

COMPRESSION The default value is NONE compression

  • GZIP / LZO / Zippy / Snappy

disable_all 'toplist.*' disable_all supports regular expressions and lists the currently matching tables. drop_all is the same

hbase table pre-partitioning -- manual partitioning

One way to speed up batch writing is to create some empty regions in advance. When data is written to HBase, the data load is balanced within the cluster according to the region partitioning. Reduce automatic partitioning when data reaches the storefile size

Time consumption, and there is another advantage, that is, the reasonable design of rowkey can make the concurrent requests of each region evenly distributed (tend to be uniform) to maximize the IO efficiency.

Row key design

Keep the number of column families as small as possible, usually 2-3

rowkey

  • According to the characteristics of the lexicographic order, the data that needs to be queried in batches should be stored as continuously as possible (spear)
  • Assemble the query condition keywords into the rowkey as much as possible, and put the most frequently queried conditions as close to the front as possible.
  • The rowkey is recommended to be as short as possible and should not exceed 16 bytes.

Minimize the size of row keys and column families. In HBase, a value is always transmitted together with its key.
Each cell in HFile stores the rowkey. If the rowkey is too large, it will affect the storage efficiency.
MemStore will cache part of the data in memory. If the rowkey field is too long, the effective utilization of memory will be reduced, and the system cannot cache more data, which will reduce retrieval efficiency.

It is recommended to use the high bit of the rowkey as the hash field, which is randomly generated by the program, and the low bit as the time field. This will increase the probability of evenly distributing data in each RegionServer to achieve load balancing. (Shield)

rowkey contradiction

  • The rows in HBase are sorted in the lexicographic order of the rowkey. This design optimizes the scan operation and allows related rows and rows that will be read together to be stored in adjacent locations for easy scanning. However, poor rowkey design is the source of hot spots.

Hotspot resolution

  • Add salt to add a random string before the rowkey
  • Hashing a line will always be salted with a prefix
  • Reversing fixed-length or numeric rowkeys sacrifices the order of the rowkeys.
  • Timestamp reversal

You can use Long.Max_Value - timestamp to append to the end of the key, for example [key][reverse_timestamp]. The latest value of [key] can be obtained by scanning [key] to obtain the first record of [key], because the rowkey in HBase is ordered and the first record is the last entered data.

Summarize

The above is the full content of this article. I hope that the content of this article will have certain reference learning value for your study or work. Thank you for your support of 123WORDPRESS.COM. If you want to learn more about this, please check out the following links

You may also be interested in:
  • General MapReduce program to copy HBase table data
  • Steps to build a Hadoop+HBase+ZooKeeper distributed cluster environment
  • Summary of 10 common HBase operation and maintenance tools
  • HBase Introduction
  • In-depth analysis of the advantages of hbase
  • Hbase column storage tutorial
  • Detailed explanation of the data model of HBase tables

<<:  Summary of several MySQL installation methods and configuration issues

>>:  How to simplify Redux with Redux Toolkit

Recommend

CSS tips for implementing Chrome tab bar

This time let’s look at a navigation bar layout w...

How to check whether a port is occupied in LINUX

I have never been able to figure out whether the ...

Summary of commonly used tags in HTML (must read)

Content Detail Tags: <h1>~<h6>Title T...

Simple steps to encapsulate components in Vue projects

Table of contents Preface How to encapsulate a To...

MySQL can actually implement distributed locks

Preface In the previous article, I shared with yo...

Implementation of nginx proxy port 80 to port 443

The nginx.conf configuration file is as follows u...

The role of MySQL 8's new feature window functions

New features in MySQL 8.0 include: Full out-of-th...

How to enter and exit the Docker container

1 Start the Docker service First you need to know...

Tips on disabling IE8 and IE9's compatibility view mode using HTML

Starting from IE 8, IE added a compatibility mode,...

Practical method of deleting associated tables in MySQL

In the MySQL database, after tables are associate...

MySQL 5.7.17 installation and configuration tutorial under CentOS6.9

CentOS6.9 installs Mysql5.7 for your reference, t...

How to smoothly upgrade nginx after compiling and installing nginx

After nginx is compiled and installed and used fo...