MySQL chooses the appropriate data type for id

MySQL chooses the appropriate data type for id

When choosing the data type of id, you need to consider not only the data storage type, but also how MySQL calculates and compares this type. For example, MySQL stores ENUM and SET types as integers internally, but compares them as strings in string scenarios. Once the data type of the id is selected, you need to ensure that the data types of the related data tables that reference the id are consistent and completely consistent, including properties such as length and whether it is signed! Mixing different data types can cause performance problems, and even if there are no performance problems, implicit data conversions during comparisons can cause subtle errors. However, if you forget about the different data types during the actual development process, unexpected problems may suddenly arise.

When choosing the length, you also need to choose a small field length as much as possible and leave some room for future growth. For example, if it is used to store provinces, we only have dozens of values. At this time, it is better to use TINYINT rather than INT. If the related table also stores this id, the efficiency difference will be very large.

Here are some typical types that apply to ids:

  • Integer: Integer is usually the best choice, because integer operations and comparisons are fast, and you can set the AUTO_INCREMENT attribute to automatically increment.
  • ENUM and SET: Enumerations and sets are usually not chosen as IDs, but they are quite suitable for columns that contain "type", "status", and "gender". For example, when we need a table to store a drop-down menu, there is usually a value and a name. In this case, it is also possible to use an enumeration as the primary key.
  • Strings: Avoid using strings as IDs as much as possible. First, strings take up more space, and second, they are usually slower than integers. When choosing a string as an id, you also need to pay special attention to functions such as MD5, SHA1, and UUID. Each value is a random value in a large range, with no order, which makes insertion and querying slower:
    • During insertion, since the index is created at a random location (causing paging, random disk access, and clustered index fragmentation), the insertion speed will be reduced.
    • When querying, adjacent data rows may span a large distance on disk or in memory, which can also result in slower performance.

If you do want to use a UUID value, you should remove the "-" character, or use the UNHEX function to convert it to a 16-byte number and store it using BINARY(16). Then you can use the HEX function to get it in hexadecimal format. There are many ways to generate UUID, some are randomly distributed, and some are ordered, but even the ordered ones do not perform as well as integers.

Summary of Distributed ID Solutions

ID is the unique identifier of data. The traditional approach is to use UUID and the database's auto-increment ID. Today, MySQL is used more and more widely, and because it requires transaction support, the Innodb storage engine is usually used. UUID is too long and disordered, so it is not suitable as the primary key in Innodb. Auto-increment ID is more appropriate, but with business development, the amount of data will become larger and larger, and the data needs to be divided into tables. After the tables are divided, the data in each table will be auto-incremented at its own pace, and there is a high probability of ID conflicts. At this time, a separate mechanism is needed to generate a unique ID. The generated ID can also be called a distributed ID or a global ID. Let's analyze the mechanisms for generating distributed IDs.

Database auto-increment ID

This method is based on the database's auto-increment ID. You need to use a separate database instance and create a separate table in this instance:

The table structure is as follows:

CREATE DATABASE `SEQID`;

CREATE TABLE SEQID.SEQUENCE_ID (
	id bigint(20) unsigned NOT NULL auto_increment, 
	stub char(10) NOT NULL default '',
	PRIMARY KEY (id),
	UNIQUE KEY stub (stub)
)ENGINE=MyISAM;

You can use the following statement to generate and obtain an auto-increment ID

begin;
replace into SEQUENCE_ID (stub) VALUES ('anyword');
select last_insert_id();
commit;

The stub field does not have any special meaning here. It is just for the convenience of inserting data. Only when data can be inserted can an auto-increment id be generated. For insertion, we use replace. Replace will first check whether there is data with the same value as the stub specified. If it exists, it will delete it first and then insert it. If it does not exist, it will directly insert it.

This mechanism for generating distributed IDs requires a separate MySQL instance. Although it is feasible, it is not enough in terms of performance and reliability. Every time a business system needs an ID, it needs to request the database to obtain it, which has low performance. If this database instance goes offline, all business systems will be affected. ; Therefore, the data in this way has a certain degree of unreliability.

Database multi-master mode

If our two databases form a master-slave cluster, the database reliability problem can be solved under normal circumstances. However, if the master database fails and the data is not synchronized to the slave database in time, ID duplication will occur. This is why we can use a multi-master mode☞dual-master mode cluster, that is, both MySQL instances can generate auto-increment IDs independently, which can improve efficiency. However, if no other modifications are made, the two MySQL instances are likely to generate the same ID. You need to configure different starting values ​​and auto-increment steps for each MySQL instance.

First MySQL instance configuration (mysql_01):

set @@auto_increment_offset = 1; -- starting value set @@auto_increment_increment = 2; -- step length

Second MySQL instance configuration (mysql_02):

set @@auto_increment_offset = 2; -- starting value set @@auto_increment_increment = 2; -- step length

After the above configuration, the id sequences generated by the two MySQL instances are as follows:
mysql_01: The starting value is 1, the step length is 2, and the sequence of ID generation is: 1, 3, 5, 7, 9, ...
mysql_02:, the starting value is 2, the step length is 2, and the sequence of ID generation is: 2, 4, 6, 8, 10, ...

For this solution of generating distributed IDs, you need to add a new distributed ID generation application, such as DistributIdService. This application provides an interface for business applications to obtain IDs. When a business application needs an ID, it requests DistributIdService through RPC. DistributIdService randomly obtains the ID from the two MySQL instances above.

After implementing this solution, even if one of the MySQL instances goes offline, it will not affect DistributIdService, and DistributIdService can still use another MySQL to generate IDs.

However, this solution does not scale well. If two MySQL instances are not enough and you need to add a new MySQL instance to improve performance, it will be troublesome.

Now if you want to add a new instance mysql_03, how do you do it?

  • First, the step size of mysql_01 and mysql_02 must be changed to 3, and it can only be modified manually, which takes time.
  • Second, because mysql_01 and mysql_02 are constantly increasing, we may need to set a larger starting value for mysql_03 to allow sufficient time to modify the step size of mysql_01 and mysql_02.
  • Third, duplicate IDs are likely to occur when modifying the step size. To solve this problem, you may need to shut down the machine.

Number segment mode

This mode can be understood as batch acquisition. For example, when DistributIdService obtains IDs from the database, if multiple IDs can be obtained in batches and cached locally, it will greatly improve the efficiency of business applications in obtaining IDs.

For example, each time DistributIdService obtains an ID from the database, it obtains a number segment, such as (1,1000]. This range represents 1000 IDs. When the business application requests DistributIdService to provide an ID, DistributIdService only needs to increment the number locally from 1 and return it, without requesting the database every time. It does not go to the database to obtain the next number segment until the local number is incremented to 1000, that is, when the current number segment has been used up.

Therefore, we need to modify the database table as follows:

CREATE TABLE id_generator (
  id int(10) NOT NULL,
  current_max_id bigint(20) NOT NULL COMMENT 'Current maximum id',
  increment_step int(10) NOT NULL COMMENT 'Increment step',
  PRIMARY KEY (`id`)
)ENGINE=InnoDB DEFAULT CHARSET=utf8;

This database table is used to record the auto-increment step and the current maximum value of the auto-increment ID (that is, the last value of the currently applied number segment). Because the auto-increment logic has been moved to DistributIdService, the database no longer needs this logic.

This solution is no longer strongly dependent on the database. Even if the database is unavailable, DistributIdService can continue to support it for a period of time. However, if the DistributIdService is restarted, a segment of IDs will be lost, resulting in ID holes.

In order to improve the high availability of DistributIdService, a cluster is required. When a business requests the DistributIdService cluster to obtain an ID, it will randomly select a DistributIdService node to obtain it. For each DistributIdService node, the database is connected to the same database, so multiple DistributIdService nodes may request the database to obtain the number segment at the same time. In this case, optimistic locking is required for control. For example, a version field is added to the database table. When obtaining the number segment, the following SQL is used:

update id_generator set current_max_id=#{newMaxId}, version=version+1 where version = #{version}

Because newMaxId is calculated in DistributIdService based on oldMaxId + step size, as long as the above update is successful, it means that the number segment is obtained successfully.

In order to provide high availability of the database layer, the database needs to be deployed in multi-master mode. For each database, it is necessary to ensure that the generated number segments are not repeated. This requires using the original idea and adding the starting value and step size to the database table. For example, if there are two MySQL servers, then:
mysql_01 will generate a number segment (1,1001], and the sequence when it is incremented is 1, 3, 4, 5, 7...
mysql_02 will generate a number segment (2,1002], and the sequence when it is incremented is 2, 4, 6, 8, 10...

For specific implementation code, please refer to: tinyid

Snowflake Algorithm

The three modes of database self-increment ID mode, database multi-master mode, and number segment mode are all based on the idea of ​​self-increment; the idea of ​​the snowflake algorithm can be briefly understood below.
Snowflake is Twitter's open source distributed ID generation algorithm. It is an algorithm, so it is different from the above three distributed ID generation mechanisms. It does not rely on a database.

The core idea is that the distributed ID is a fixed long number. A long number occupies 8 bytes, that is, 64 bits. The allocation of bits in the original Snowflake algorithm is as follows:

  • The first bit is the identification part. In Java, since the highest bit of long is the sign bit, positive numbers are 0 and negative numbers are 1. Generally, the generated ID is a positive number, so it is fixed to 0.
  • The timestamp part occupies 41 bits, which is the millisecond time. Generally, the current timestamp is not stored, but the difference of the timestamp (current time - fixed start time), so that the generated ID can start from a smaller value; a 41-bit timestamp can be used for 69 years, (1L << 41) / (1000L * 60 * 60 * 24 * 365) = 69 years
  • The working machine ID occupies 10 bits, which is relatively flexible. For example, the first 5 bits can be used as the data center computer room identifier, and the last 5 bits can be used as the single computer room machine identifier. 1024 nodes can be deployed.
  • The serial number part occupies 12 bits, and supports the same node to generate 4096 IDs in the same millisecond.

According to the logic of this algorithm, we only need to implement this algorithm in Java language and encapsulate it into a tool method. Then each business application can directly use this tool method to obtain the distributed ID. We only need to ensure that each business application has its own working machine ID, without having to build a separate application to obtain the distributed ID. It also does not rely on a database.

Specific code implementation

package com.yeming.tinyid.application;

import static java.lang.System.*;

/**
 * @author yeming.gao
 * @Description: Snowflake algorithm implementation* <p>
 * The SnowFlake algorithm is used to generate 64-bit IDs, which can be stored in long integers and can be used to generate unique IDs in distributed systems.
 * And the generated IDs have a rough order. In this implementation, the generated 64-bit ID can be divided into 5 parts:
 * 0 - 41-bit timestamp - 5-bit data center ID - 5-bit machine ID - 12-bit serial number * @date 2020/07/28 16:15
 */
public class SnowFlake {
    /**
     * Starting timestamp */
    private static final long START_STMP = 1480166465631L;

    /**
     * Number of digits occupied by the machine ID */
    private static final long MACHINE_BIT = 5;
    /**
     * Number of bits occupied by the data center */
    private static final long DATACENTER_BIT = 5;
    /**
     * Number of digits occupied by the serial number */
    private static final long SEQUENCE_BIT = 12;

    /**
     * Maximum value of machine ID*/
    private static final long MAX_MACHINE_NUM = ~(-1L << MACHINE_BIT);
    /**
     * Data center maximum value */
    private static final long MAX_DATACENTER_NUM = ~(-1L << DATACENTER_BIT);
    /**
     * Maximum serial number value*/
    private static final long MAX_SEQUENCE = ~(-1L << SEQUENCE_BIT);
    /**
     * The displacement of each part to the left*/
    private static final long MACHINE_LEFT = SEQUENCE_BIT;
    private static final long DATACENTER_LEFT = SEQUENCE_BIT + MACHINE_BIT;
    private static final long TIMESTMP_LEFT = DATACENTER_LEFT + DATACENTER_BIT;

    private long datacenterId; //Data centerprivate long machineId; //Machine IDprivate long sequence = 0L; //Serial numberprivate long lastStmp = -1L; //Last timestampprivate SnowFlake(long datacenterId, long machineId) {
        if (datacenterId > MAX_DATACENTER_NUM || datacenterId < 0) {
            throw new IllegalArgumentException("datacenterId can't be greater than MAX_DATACENTER_NUM or less than 0");
        }
        if (machineId > MAX_MACHINE_NUM || machineId < 0) {
            throw new IllegalArgumentException("machineId can't be greater than MAX_MACHINE_NUM or less than 0");
        }
        this.datacenterId = datacenterId;
        this.machineId = machineId;
    }

    /**
     * Generate the next ID
     *
     * @return long
     */
    private synchronized long nextId() {
        long currStmp = System.currentTimeMillis();
        if (currStmp < lastStmp) {
            throw new RuntimeException("Clock moved backwards. Refusing to generate id");
        }
        if (currStmp == lastStmp) {
            //In the same millisecond, the sequence number increases automatically sequence = (sequence + 1) & MAX_SEQUENCE;
            //The number of sequences in the same millisecond has reached the maximum if (sequence == 0L) {
                currStmp = getNextMill();
            }
        } else {
            //In different milliseconds, the sequence number is set to 0
            sequence = 0L;
        }
        lastStmp = currStmp;
        return (currStmp - START_STMP) << TIMESTMP_LEFT //Timestamp part | datacenterId << DATACENTER_LEFT //Data center part | machineId << MACHINE_LEFT //Machine ID part | sequence; //Serial number part}

    private long getNextMill() {
        long mill = System.currentTimeMillis();
        while (mill <= lastStmp) {
            mill = System.currentTimeMillis();
        }
        return mill;
    }

    public static void main(String[] args) {
        SnowFlake snowFlake = new SnowFlake(2, 3);
        // Maximum value of data center identifier long maxDatacenterNum = ~(-1L << DATACENTER_BIT);
        //Maximum value of machine identifier long maxMachineNum = ~(-1L << MACHINE_BIT);
        // Maximum value of sequence number long maxSequence = ~(-1L << SEQUENCE_BIT);
        out.println("Maximum value of data center ID: " + maxDatacenterNum + "; Maximum value of machine ID: " + maxMachineNum + "; Maximum value of sequence number: " + maxSequence);
        for (int i = 0; i < (1 << 12); i++) {
            out.println(snowFlake.nextId());
        }
    }
}

The snowflake algorithm can refer to:

  • Baidu (uid-generator)
  • Meituan (Leaf)

The above is the details of how to choose the appropriate data type for MySQL id. For more information about choosing the appropriate data type for MySQL id, please pay attention to other related articles on 123WORDPRESS.COM!

You may also be interested in:
  • Detailed explanation of MySql data type tutorial examples
  • MySQL 5.7 Common Data Types
  • MySQL data type selection principles
  • Implementation of mysql decimal data type conversion
  • Implementation of mysql data type conversion
  • Detailed explanation of the usage of MySQL data type DECIMAL
  • Detailed explanation of the decimal padding problem of decimal data type in MySQL
  • MySQL data type details

<<:  3 different ways to clear the option options in the select tag

>>:  Corporate website and column dictionary Chinese and English comparison Friends who make Chinese and English bilingual corporate websites need

Recommend

Deeply understand how nginx achieves high performance and scalability

The overall architecture of NGINX is characterize...

Docker Stack deployment method steps for web cluster

Docker is becoming more and more mature and its f...

JavaScript implements H5 gold coin function (example code)

Today I made a Spring Festival gold coin red enve...

Detailed explanation of several error handling when Nginx fails to start

When using Nginx as a Web server, I encountered t...

Detailed explanation of the usage of DECIMAL in MySQL data type

Detailed explanation of the usage of DECIMAL in M...

How to allow external network access to mysql and modify mysql account password

The root account of mysql, I usually use localhos...

A line of CSS code that crashes Chrome

General CSS code will only cause minor issues wit...

How to build a redis cluster using docker

Table of contents 1. Create a redis docker base i...

Solution to the problem of saving format in HTML TextArea

The format of textarea can be saved to the databas...

ie filter collection

IE gave us a headache in the early stages of deve...