MySQL sharding details

1. Business scenario introduction

Suppose there is an e-commerce system that uses MySQL . You need to design a solution that can store large amounts of data, has high concurrency, and is highly scalable. There is a user table in the database. There will be a lot of users and high scalability needs to be achieved. How would you design it? OK, let's first look at the traditional way of sharding

Of course, some friends know how to split the database according to provinces/regions or certain business relationships.

OK, now the question is, how to ensure that data is stored in different tables in different libraries? Let the library reduce concurrency pressure? How should we formulate the rules for sub-library and table division? Don't worry, it's coming soon

2. Horizontal database and table division method

1.RANGE

The first method is to specify a data range to split the table, for example, from 1 to 1000000, 1000001-2000000, using one table for one million, as shown in the following figure

Insert picture description here Of course, this method requires maintaining the table ID, especially in a distributed environment. For this distributed ID, it is recommended to use Redis without using a third-party table sharding tool. Redis 's incr operation can easily maintain the distributed table ID.

Advantages of the RANGE method: Simple expansion, just build the database and table in advance

Disadvantages of the RANGE method: Most reads and writes will access new data, which creates an IO bottleneck. This will cause excessive pressure on the new database and is not recommended.

2. HASH modulus

To solve the IO bottleneck problem of RANGE table partitioning, we can use the method of taking the modulus of user ID HASG to partition the database and table, as shown in the figure:

In this way, the data can be dispersed in different databases and tables, avoiding the problem of IO bottleneck.

Advantages of HASH modulus method: It can ensure that data is evenly distributed in different databases and tables, reducing database pressure

Disadvantages of HASH modulus method: expansion is difficult, and the hash value needs to be recalculated and allocated to different databases and tables each time data is migrated

3. Consistent HASH

Modulo by HASH is not the most perfect way, so what is it?

Using consistent HASH algorithm can perfectly solve the problem

Common HASH algorithm:

A common hash algorithm maps a binary value of arbitrary length to a shorter binary value of fixed length. This small binary value is called a hash value. A hash value is a unique and extremely compact numeric representation of a piece of data.

The shortcomings of ordinary hash algorithms in distributed applications: In a distributed storage system, data must be stored on specific nodes. If we use ordinary hash algorithms for routing and map data to specific nodes, such as key%n , key is key of the data, and n is the number of machine nodes. If a machine joins or leaves the cluster, all data mappings will be invalid. If it is persistent storage, data migration must be done. If it is a distributed cache, other caches will become invalid.

Consistent HASH algorithm: Use the commonly used hash algorithm to hash the corresponding key into a space with 2^32 nodes, that is, a digital space from 0 to (2^32)-1. Now we can connect these numbers end to end and imagine them as a closed circle, as shown in the figure below.

This ring is connected end to end. Suppose there are three database server nodes node1 , node2 , and node3 . Each node is responsible for storing its own part of the user data. Suppose there are users user1, user2, and user3. We can perform HASH operations on the server nodes. Suppose after HASH calculation, user1 falls on node1 , user2 falls on node2 , and user3 falls on user3.

OK, now let's assume that node3 fails.

user3 will land on node1, and the previous data of node1 and node2 will not change. Suppose node4 is added

You will find that user3 will fall on node4. You will find that through the analysis of node addition and deletion, the consistent hashing algorithm can minimize data migration while maintaining monotonicity. Such an algorithm is very suitable for distributed clusters, avoiding large amounts of data migration and reducing server pressure.

Of course there is still one problem that needs to be solved, and that is balance. From the figure we can see that when there are relatively few server nodes, a problem will arise, that is, a large amount of data will inevitably be concentrated on one node, and very little data will be concentrated on another node.

In order to solve this data skew problem, the consistent hashing algorithm introduces a virtual node mechanism, which calculates multiple hashes for each service node and places a node at each calculation result location, called a virtual node. The specific approach is to first determine the number of virtual nodes associated with each physical node, and then add a number after the IP or host name. For example, in the above case, three virtual nodes can be calculated for each server, so the hash values of " node 1-1 ", " node 1-2 ", " node 1-3 node 2-1 ", " node 2-2 ", " node 2-3 ", " node 3-1 ", " node 3-2 ", and " node 3-3 " can be calculated respectively, thus forming nine virtual nodes.

For example, user1 is located on node 1-1 , node 1-2 , and node 1-3 , which are actually all located on node1 This can solve the problem of data skew when there are few service nodes. Of course, the number of virtual nodes is not fixed at three or at most or at least three. This is just an example. The specific number of virtual nodes needs to be determined according to the actual business situation.

Advantages of the consistent HASH method: virtual nodes can ensure that data is evenly distributed in different databases and tables, and adding or deleting nodes does not affect the data of other nodes, with high availability and strong disaster tolerance.

Disadvantages of the consistent modulus method: Well, compared with the above two, it can be considered that there are none.

3. Unit Testing

OK, no more nonsense, next is the unit test, assuming there are three nodes, each node has three virtual nodes

package com.hyh.core.test;

import com.hyh.utils.common.StringUtils;
import org.junit.Test;

import java.util.LinkedList;
import java.util.List;
import java.util.SortedMap;
import java.util.TreeMap;

/**
 * Consistency HASH TEST
 *
 * @Author heyuhua
 * @create 2021/1/31 19:50
 */
public class ConsistentHashTest {

    //List of servers to be added to the Hash ring private static String[] servers = {"192.168.5.1", "192.168.5.2", "192.168.5.3"};

    //Real node list. Considering the scenarios where the server goes online and offline, that is, the addition and deletion scenarios will be more frequent, it is better to use LinkedList here private static List<String> realNodes = new LinkedList<>();

    //Virtual node, key represents the hash value of the virtual node, value represents the name of the virtual node private static SortedMap<Integer, String> virtualNodes = new TreeMap<>();

    //One real node corresponds to 3 virtual nodes private static final int VIRTUAL_NODES = 3;

    /**
     * Test the consistency of HASH with virtual nodes
     */
    @Test
    public void testConsistentHash() {
        initNodes();
        String[] users = {"user1", "user2", "user3", "user4", "user5", "user6", "user7", "user8", "user9"};
        for (int i = 0; i < users.length; i++)
            System.out.println("[" + users[i] + "]'s hash value is " +
                    getHash(users[i]) + ", routed to node[" + getServer(users[i]) + "]");
    }

    /**
     * First add the original server to the real node list */
    public void initNodes() {
        for (int i = 0; i < servers.length; i++)
            realNodes.add(servers[i]);
        for (String str : realNodes) {
            for (int i = 0; i < VIRTUAL_NODES; i++) {
                String virtualNodeName = str + "-virtual node" + String.valueOf(i);
                int hash = getHash(virtualNodeName);
                System.out.println("Virtual node [" + virtualNodeName + "] is added, hash value is " + hash);
                virtualNodes.put(hash, virtualNodeName);
            }
        }
        System.out.println();
    }

    //Use FNV1_32_HASH algorithm to calculate the Hash value of the server. The method of rewriting hashCode is not used here, and the final effect is no different private static int getHash(String str) {
        final int p = 16777619;
        int hash = (int) 2166136261L;
        for (int i = 0; i < str.length(); i++)
            hash = (hash ^ str.charAt(i)) * p;
        hash += hash << 13;
        hash ^= hash >> 7;
        hash += hash << 3;
        hash ^= hash >> 17;
        hash += hash << 5;

        // If the calculated value is negative, take its absolute value if (hash < 0)
            hash = Math.abs(hash);
        return hash;
    }

    //Get the node to be routed to private static String getServer(String key) {
        //Get the hash value of the key int hash = getHash(key);
        // Get all Maps greater than the Hash value
        SortedMap<Integer, String> subMap = virtualNodes.tailMap(hash);
        String virtualNode;
        if (subMap.isEmpty()) {
            //If there is no hash value greater than the key, start from the first node Integer i = virtualNodes.firstKey();
            //Return the corresponding server virtualNode = virtualNodes.get(i);
        } else {
            //The first Key is the node closest to the node in clockwise direction Integer i = subMap.firstKey();
            //Return the corresponding server virtualNode = subMap.get(i);
        }
        //The virtualNode virtual node name needs to be intercepted if (StringUtils.isNotBlank(virtualNode)) {
            return virtualNode.substring(0, virtualNode.indexOf("-"));
        }
        return null;
    }
}

Here we simulate the situation where 9 user objects are routed after being hashed. See the results

Summarize:

It is strongly recommended to use consistent HASH algorithm for sharding in distributed microservice architecture environment. Of course, business data consistency and distributed transaction problems will also arise in distributed environment. In the next issue, we will discuss solutions for data consistency and distributed transactions.

This is the end of this article about the details of MySQL sharding. For more information about MySQL sharding, please search for previous articles on 123WORDPRESS.COM or continue to browse the following related articles. I hope you will support 123WORDPRESS.COM in the future!

You may also be interested in:

Getting Started Guide to MySQL Sharding
A brief discussion on order reconstruction: MySQL sharding
Summary of MySQL's commonly used database and table sharding solutions
Mysql database sharding and table sharding completely collapsed
Several methods of primary key processing after Mysql database and table sharding
SpringBoot+MybatisPlus+Mysql+Sharding-JDBC sharding
Several ways to shard MySQL databases and tables

<<: JavaScript basics for loop and array

>>: Differences between FLOW CHART and UI FLOW

What does it mean to add a question mark (?) after the CSS link address?

MySQL sharding details

1. Business scenario introduction

2. Horizontal database and table division method

1.RANGE

2. HASH modulus

3. Consistent HASH

3. Unit Testing

What does it mean to add a question mark (?) after the CSS link address?

HTML left, center, right adaptive layout (using calc css expression)

Abbreviation of HTML DOCTYPE

Detailed explanation of how to solve the problem that the docker container cannot access the host machine through IP

How to remove carriage return characters from text in Linux

Implementation of crawler Scrapy image created by dockerfile based on alpine

A brief introduction to MySQL storage engine

MySQL FAQ series: When to use temporary tables

Vue's simple implementation process of imitating NetEase Cloud Music Player interface

Do you know how to use the flash wmode attribute in web pages?

Recommend

vue+elementui implements the complete code of adding and modifying a shared bullet box

CSS to achieve single-select folding menu function

Detailed explanation of command to view log files in Linux environment

WeChat applet example of using functions directly in {{ }}

The process of installing Docker on Windows Server 2016 and the problems encountered

Analysis of the locking mechanism of MySQL database

CSS3 sets a mask for the background image and solves the problem of mask style inheritance

How to view MySQL links and kill abnormal links

JavaScript two pictures to understand the prototype chain

Detailed explanation of the implementation steps of MySQL dual-machine hot standby and load balancing

Summary of 16 XHTML1.0 and HTML Compatibility Guidelines

Solution to HTML2 canvas SVG not being recognized

JavaScript custom plug-in to implement tab switching function

Analyze the problem of Zabbix 5.0 disk automatic discovery and read-write monitoring

Detailed code for building a multi-person video chat service based on webrtc on Ubuntu