Two ideas for implementing database horizontal segmentation

introduction

With the widespread popularity of Internet applications, the storage and access of massive data has become a bottleneck problem in system design. For a large-scale Internet application, billions of PVs per day undoubtedly put a very high load on the database. This has caused great problems for the stability and scalability of the system. Improving website performance through data segmentation and horizontally expanding the data layer has become the preferred method for architecture developers.

Horizontally split the database: This can reduce the load on a single machine and minimize the losses caused by downtime;
Load balancing strategy: can reduce the access load of a single machine and reduce the possibility of downtime;
Cluster solution: solves the problem of single-point database inaccessibility caused by database downtime;
Read-write separation strategy: maximizes the speed and concurrency of reading data in the application;

Why data segmentation?

The above is a brief description and explanation of what data segmentation is. Readers may wonder why data segmentation is necessary? Is a mature and stable database like Oracle sufficient to support the storage and query of massive data? Why do we need data slicing?

Indeed, Oracle's DB is very mature and stable, but the high usage fees and high-end hardware support are not affordable for every company. Just imagine the tens of millions of yuan in annual usage fees and the minicomputers costing tens of millions of yuan as hardware support. Can an average company afford this? Even if we can afford it, if there is a better solution, a cheaper solution with better horizontal scalability, why don’t we choose it?

We know that no matter how well configured each machine is, it has its own physical limit. So when our application has reached or far exceeded a certain limit of a single machine, we have no choice but to seek help from other machines or continue to upgrade our hardware. However, the common solution is horizontal expansion by adding more machines to share the pressure. We also have to consider that as our business logic continues to grow, can our machines meet the demand through linear growth? Sharding can easily distribute computing, storage, and I/O to multiple machines in parallel, which can fully utilize the various processing capabilities of multiple machines, while avoiding single points of failure, improving system availability, and performing good error isolation.

Considering the above factors, data segmentation is necessary. We use free MySQL and cheap servers or even PCs to build a cluster, achieving the effect of a minicomputer + a large commercial DB, reducing a lot of capital investment and lowering operating costs. Why not do it?

In large and medium-sized projects, when designing the database, considering the maximum amount of data that the database can bear, the database or data table is usually split horizontally to reduce the pressure on a single database or a single table. Here we introduce the data table segmentation methods commonly used in two projects. Of course, these methods are all in the program? Use certain techniques to route to specific tables. First of all, we need to confirm what to base the horizontal split on? In our system (SNS), the user's UID runs through the system and is unique and self-increasing. It is best to divide the table according to this field.

Method 1: Using MD5 hash

The method is to encrypt the UID with md5, then take the first few digits (we take the first two digits here), and then different UIDs can be hashed into different user tables (user_xx).

function getTable( $uid ){
 $ext = substr(md5($uid),0,2);
 return "user_".$ext;
}

By using this trick, we can distribute different UIDs into 256 user tables, namely user_00, user_01...user_ff. Because UID is a number and increases, according to the md5 algorithm, user data can be almost evenly distributed to different user tables.

But there is a problem here. If our system has more and more users, the amount of data in a single table will inevitably become larger and larger, and the table cannot be expanded according to this algorithm, which will bring us back to the problem at the beginning of the article.

Method 2: Use shift

The specific method is:

public function getTable( $uid ) {
 return "user_" . sprintf( "d", ($uid >> 20) );
}

Here, we shift the uid 20 bits to the right, so that we can put the first 1 million or so user data in the first table user_0000, and the second 1 million user data in the second table user_0001, and so on. If we have more and more users, we can just add a user table. Since the table suffix we reserve is four digits, we can add 10,000 user tables here, namely user_0000, user_0001...user_9999. With 10,000 tables and 1 million data in each table, we can store 10 billion user records. Of course, if your user data is more than this, it doesn't matter. You just need to change the reserved table suffix to add expandable tables. For example, if there are 100 billion data items and each table stores 1 million, then you need 100,000 tables. We only need to keep the table suffix at 6 digits.

The above algorithm can be written more flexibly:

/**
 * Based on UID partitioning algorithm* @param int $uid //User ID
 * @param int $bit //How many bits are reserved in the table suffix* @param int $seed //How many bits to shift right*/
function getTable( $uid , $bit , $seed ){
 return "user_" . sprintf( "%0{$bit}d" , ($uid >> $seed) );
}

summary

Both of the above methods require us to make the largest possible estimate of the amount of user data in our current system and estimate the maximum load that a single table in the database can handle.

For example, in the second solution, if we estimate that our system will have 10 billion users and the optimal amount of data for a single table is 1 million, then we need to move the UID by 20 to ensure that each table has 1 million data, and retain four digits of the user table (user_xxxx) to expand 10,000 tables.

Another example is the first solution. If each table has 1 million data and the first two digits are taken after MD5, there will only be 256 tables. The total database of the system is: 256*1 million. If the total data volume of your system is more than this, then you must use MD5 to take the first three, four or even more digits.

Both methods divide the data horizontally into different tables. Compared with the first method, the second method is more scalable.

Summarize

The above is the full content of this article. I hope that the content of this article will have certain reference learning value for your study or work. Thank you for your support of 123WORDPRESS.COM. If you want to learn more about this, please check out the following links

You may also be interested in:

MySQL: mysql functions
Explanation of CAST and CONVERT functions for type conversion in MySQL database
MySQL 8.0.15 installation graphic tutorial and database basics
SQL SERVER database backup code example
Syntax and usage of window functions in PostgreSQL database
How to reasonably use the redundant fields of the database
Mysql master/slave database synchronization configuration and common errors
PHP singleton mode database connection class and page static implementation method
MySQL database case sensitivity issue
Detailed explanation of database language classification DDL, DCL, DML

<<: Detailed explanation of Nginx Location configuration (Location matching order)

>>: Detailed installation process of nodejs management tool nvm

Analyze the problem of transferring files and other parameters in the upload component of element-ui

Two ideas for implementing database horizontal segmentation

Analyze the problem of transferring files and other parameters in the upload component of element-ui

Circular progress bar implemented with CSS

Detailed explanation of error handling examples in MySQL stored procedures

Vue component library ElementUI realizes the paging effect of table list

MySQL solution for creating horizontal histogram

Detailed explanation of commonly used nginx rewrite rules

Ubuntu20.04 VNC installation and configuration implementation

TCP performance tuning implementation principle and process analysis

MySQL commonly used SQL and commands from entry to deleting database and running away

MySQL encryption and decryption examples

Recommend

How to monitor Windows performance on Zabbix

Detailed explanation of the idea of implementing dynamic columns in angularjs loop object properties

Introduction to HTML_PowerNode Java Academy

DockerToolBox file mounting implementation code

Tutorial on installing mysql8 on linux centos7

Detailed explanation of Vue's monitoring properties

Basic security settings steps for centos7 server

Summary of methods to clear cache in Linux system

Teach you how to build hive3.1.2 on Tencent Cloud

Analysis of the principles of docker containers

WeChat applet uniapp realizes the left swipe to delete effect (complete code)

Detailed explanation of how to configure secondary domain name on Apache server under Linux environment

The first step in getting started with MySQL database is to create a table

Install redis and MySQL on CentOS

Detailed tutorial on installing JDK1.8 on Linux