Getting Started Guide to MySQL Sharding

Getting Started Guide to MySQL Sharding

Preface

Relational databases are more likely to become system bottlenecks. The storage capacity, number of connections, and processing power of a single machine are limited. When the amount of data and concurrency increase, the database must be segmented.

The means of data sharding is to divide the database and table into different parts. There are two aspects to sharding libraries and tables. It may be sharding libraries only without sharding tables, or sharding tables only without sharding libraries.

The core content of database distribution is nothing more than data segmentation, as well as the positioning and integration of data after segmentation.

Why do we need to divide the database into different tables?

Sub-table

When the amount of data in a single table is too large, it will seriously affect the performance of SQL execution. Generally, when a single table reaches several million, the performance will be relatively poor, and then it is necessary to divide the table into sub-tables.

Splitting tables means putting the data of one table into multiple tables, and then querying only one table. For example, divide the table by project ID: put a fixed number of project data in one table, so that the amount of data in each table can be controlled within a controllable range.

Sub-library

According to experience, a database needs to be expanded when it supports a maximum of 2000 concurrent requests, and the concurrency value of a healthy single database is best maintained at around 1000. Then you can split the data of one library into multiple libraries, and only need to access one library when accessing.

This is what is called sharding the database and tables. Why do we need to shard the database and tables?

  • Improve concurrent support capabilities
  • Reduce disk usage
  • Improving SQL execution performance

How to divide the database and table

Directly look at the picture:

For vertical splitting, it is recommended to design the table well at the beginning of system design to avoid vertical splitting of the table.

Horizontal splitting can be done by range or by a certain field hash. The advantage of dividing by range is that capacity expansion is simple, and you only need to prepare a new table or database. However, it is easy to cause hot issues, so it should be considered in combination with business scenarios when actually used. The advantage of dividing by hash is that the request pressure of each database or table can be evenly distributed. The disadvantage is that capacity expansion is difficult, and the previous data needs to be rehashed, which involves a data migration process.

Problems caused by sharding

Sub-database and sub-table can effectively alleviate the pressure on network IO, hardware resources, and number of connections brought by a single machine and a single database. But it also brought some problems.

  • Transaction consistency issues are solved through distributed transactions or ensuring eventual consistency.
  • Cross-node association query join problem global table, field redundancy, data assembly, ER fragmentation
  • Cross-node paging, sorting, and aggregation function problems First, query on different shard nodes, and finally summarize or merge the results
  • Global primary key duplicate avoidance problem Various distributed ID generation algorithms
  • If the data migration and expansion problem is range sharding, you only need to add nodes to expand the capacity.
    If it is hash, the general practice is to read out the historical data first, and then write the data to each shard node according to the specified sharding rules.

Data Migration

Two data migration solutions are introduced.

The lowest-level solution is to shut down the system for a while, use a pre-written data import tool to extract the data from a single table, and write it to the sub-library and sub-table.

The second solution sounds more reliable, the dual-write migration solution. In the online system, all the places where data was previously written, the addition, deletion and modification operations, in addition to the addition, deletion and modification of the old database, are added to the addition, deletion and modification of the new database. This is the so-called double write. After the system is deployed, run the data import tool in Solution 1 to read the old database and write the new database. When writing, the last modification time of the data should be determined based on fields such as gmt_modified. The data will only be written if the new database does not have the data or the data is newer than the new database data. Simply put, it is not allowed to overwrite new data with old data.

After one round of writing, there may still be inconsistencies. In this case, the program will automatically perform a new round of verification, comparing each piece of data in each table of the new and old databases. If there are any differences, the program will read the data from the old database and write it again. Repeat the cycle until the data is completely consistent.

middleware

The more common middleware for sub-library and sub-table are:

  • Cobar: It was developed and open-sourced by Alibaba's b2b team. It is a proxy layer solution that is between the application server and the database server. The application accesses the Cobar cluster through the JDBC driver. Cobar decomposes the SQL according to the SQL and database partitioning rules, and then distributes it to different database instances in the MySQL cluster for execution. Operations such as read-write separation, stored procedures, cross-database joins, and paging are not supported. It hasn't been updated in recent years and not many people use it.
  • TDDL: Developed by the Taobao team, it is a client-layer solution. It supports basic CRUD syntax and read-write separation, but does not support syntax such as join and multi-table query. It is not used much at present because it still relies on Taobao's diamond configuration management system.
  • Atlas: 360 open source, belongs to the proxy layer solution. It has not been maintained for several years, and there are very few companies using it now.
  • Sharding-jdbc: Dangdang’s open source, client-layer solution, has now been renamed ShardingSphere. It supports a wide range of SQL syntaxes without too many restrictions, including sharding of databases and tables, read-write separation, distributed ID generation, and flexible transactions (best effort delivery transactions, TCC transactions). Moreover, many companies use it and the community is active.
  • Mycat: Based on Cobar, it belongs to the proxy layer solution. The supported functions are very complete. Compared to Sharding-jdbc, it is younger.

In summary, Sharding-jdbc and Mycat are the options that can be considered.

The advantages of the client-layer solution such as Sharding-jdbc are that it does not require deployment, has low operation and maintenance costs, does not require secondary forwarding at the proxy layer, and has high performance. The disadvantage is coupling.
The disadvantage of Mycat's proxy layer solution is that it requires deployment and maintenance of a set of middleware by yourself, which has high operation and maintenance costs, but the advantage is that it is transparent to the project.

MySQL partitioning (deprecated)

Partitioning is introduced here mainly to avoid confusion with concepts such as splitting, database and table sharding, etc.
MySQL supports the partition function since version 5.1. Partitioning means that the database breaks down a table into multiple smaller and more manageable parts according to certain rules. As far as applications accessing the database are concerned, there is logically only one table or one index, but in fact this table may be composed of multiple physical partitions, which is transparent to the application.
MySQL partitioning introduces the concept of partition key and adopts a divide-and-conquer approach, which is conducive to managing very large tables. The partition key is used to aggregate data based on a certain interval value, a specific value list, or a HASH function, so that data is distributed in different partitions according to rules. There are six main partition types available in MySQL 5.7:

  • RANGE partitioning: Allocate data to different partitions based on a given continuous interval range.
  • LIST partitioning: Similar to RANGE partitioning, the difference is that LIST partitioning is based on the enumerated value list partitioning, while RANGE is based on a given continuous interval range partitioning.
  • COLUMNS partitioning: Similar to RANGE and LIST, the difference is that the partition key can be multiple columns or non-integer.
  • HASH partitioning: Based on the given number of partitions, data is distributed modulo different partitions.
  • KEY partitioning: Similar to HASH partitioning, but uses the hash function provided by MySQL.
  • Subpartition: also called composite partition or combined partition, that is, make another partition under the primary partition to divide the data again.

Here is an example of a LIST partition:

CREATE TABLE orders_list (
  id INT AUTO_INCREMENT,
  customer_surname VARCHAR(30),
  store_id INT,
  salesperson_id INT,
  order_date DATE,
  note VARCHAR(500),
  INDEX idx (id)
) ENGINE = INNODB
  PARTITION BY LIST(store_id) (
  PARTITION p1
  VALUES IN (1, 3, 4, 17)
  INDEX DIRECTORY = '/var/orders/district1'
  DATA DIRECTORY = '/var/orders/district1',
  PARTITION p2
  VALUES IN (2, 12, 14)
  INDEX DIRECTORY = '/var/orders/district2'
  DATA DIRECTORY = '/var/orders/district2',
  PARTITION p3
  VALUES IN (6, 8, 20)
  INDEX DIRECTORY = '/var/orders/district3'
  DATA DIRECTORY = '/var/orders/district3',
  PARTITION p4
  VALUES IN (5, 7, 9, 11, 16)
  INDEX DIRECTORY = '/var/orders/district4'
  DATA DIRECTORY = '/var/orders/district4',
  PARTITION p5
  VALUES IN (10, 13, 15, 18)
  INDEX DIRECTORY = '/var/orders/district5'
  DATA DIRECTORY = '/var/orders/district5'
);

Advantages of partitioning:

  • Expand storage capacity.
  • Optimize queries. When the WHERE clause contains partition conditions, only necessary partitions can be scanned to improve query efficiency; when queries involving aggregate functions such as SUM() and COUNT() are performed, they can be processed in parallel on each partition.
  • For data partitions that have expired or do not need to be saved, you can quickly delete the data by deleting the partition.
  • Distribute query data across multiple disks to achieve greater query throughput.

Summarize

This is the end of this article about MySQL sharding. For more information about MySQL sharding, please search for previous articles on 123WORDPRESS.COM or continue to browse the following related articles. I hope you will support 123WORDPRESS.COM in the future!

You may also be interested in:
  • A brief discussion on order reconstruction: MySQL sharding
  • MySQL sharding details
  • Summary of MySQL's commonly used database and table sharding solutions
  • Mysql database sharding and table sharding completely collapsed
  • Several methods of primary key processing after Mysql database and table sharding
  • SpringBoot+MybatisPlus+Mysql+Sharding-JDBC sharding
  • Several ways to shard MySQL databases and tables

<<:  Zen coding for editplus example code description

>>:  js implements custom drop-down box

Recommend

MYSQL METADATA LOCK (MDL LOCK) theory and lock type test

Table of contents MYSQL METADATA LOCK (MDL LOCK) ...

Example of Form action and onSubmit

First: action is an attribute of form. HTML5 has d...

Vue uses canvas handwriting input to recognize Chinese

Effect picture: Preface: Recently, I was working ...

Mini Program implements list countdown function

This article example shares the specific code for...

JavaScript programming through Matlab centroid algorithm positioning learning

Table of contents Matlab Centroid Algorithm As a ...

Why does MySQL paging become slower and slower when using limit?

Table of contents 1. Test experiment 2. Performan...

Summary of important mysql log files

Author: Ding Yi Source: https://chengxuzhixin.com...

Node.js+express message board function implementation example

Table of contents Message Board Required librarie...

HTML form and the use of form internal tags

Copy code The code is as follows: <html> &l...