introduction With the widespread popularity of Internet applications, the storage and access of massive data has become a bottleneck problem in system design. For a large-scale Internet application, billions of PVs per day undoubtedly put a very high load on the database. This has caused great problems for the stability and scalability of the system. Improving website performance through data segmentation and horizontally expanding the data layer has become the preferred method for architecture developers.
Why data segmentation? The above is a brief description and explanation of what data segmentation is. Readers may wonder why data segmentation is necessary? Is a mature and stable database like Oracle sufficient to support the storage and query of massive data? Why do we need data slicing? Indeed, Oracle's DB is very mature and stable, but the high usage fees and high-end hardware support are not affordable for every company. Just imagine the tens of millions of yuan in annual usage fees and the minicomputers costing tens of millions of yuan as hardware support. Can an average company afford this? Even if we can afford it, if there is a better solution, a cheaper solution with better horizontal scalability, why don’t we choose it? We know that no matter how well configured each machine is, it has its own physical limit. So when our application has reached or far exceeded a certain limit of a single machine, we have no choice but to seek help from other machines or continue to upgrade our hardware. However, the common solution is horizontal expansion by adding more machines to share the pressure. We also have to consider that as our business logic continues to grow, can our machines meet the demand through linear growth? Sharding can easily distribute computing, storage, and I/O to multiple machines in parallel, which can fully utilize the various processing capabilities of multiple machines, while avoiding single points of failure, improving system availability, and performing good error isolation. Considering the above factors, data segmentation is necessary. We use free MySQL and cheap servers or even PCs to build a cluster, achieving the effect of a minicomputer + a large commercial DB, reducing a lot of capital investment and lowering operating costs. Why not do it? In large and medium-sized projects, when designing the database, considering the maximum amount of data that the database can bear, the database or data table is usually split horizontally to reduce the pressure on a single database or a single table. Here we introduce the data table segmentation methods commonly used in two projects. Of course, these methods are all in the program? Use certain techniques to route to specific tables. First of all, we need to confirm what to base the horizontal split on? In our system (SNS), the user's UID runs through the system and is unique and self-increasing. It is best to divide the table according to this field. Method 1: Using MD5 hash The method is to encrypt the UID with md5, then take the first few digits (we take the first two digits here), and then different UIDs can be hashed into different user tables (user_xx). function getTable( $uid ){ $ext = substr(md5($uid),0,2); return "user_".$ext; } By using this trick, we can distribute different UIDs into 256 user tables, namely user_00, user_01...user_ff. Because UID is a number and increases, according to the md5 algorithm, user data can be almost evenly distributed to different user tables. But there is a problem here. If our system has more and more users, the amount of data in a single table will inevitably become larger and larger, and the table cannot be expanded according to this algorithm, which will bring us back to the problem at the beginning of the article. Method 2: Use shift The specific method is: public function getTable( $uid ) { return "user_" . sprintf( "d", ($uid >> 20) ); } Here, we shift the uid 20 bits to the right, so that we can put the first 1 million or so user data in the first table user_0000, and the second 1 million user data in the second table user_0001, and so on. If we have more and more users, we can just add a user table. Since the table suffix we reserve is four digits, we can add 10,000 user tables here, namely user_0000, user_0001...user_9999. With 10,000 tables and 1 million data in each table, we can store 10 billion user records. Of course, if your user data is more than this, it doesn't matter. You just need to change the reserved table suffix to add expandable tables. For example, if there are 100 billion data items and each table stores 1 million, then you need 100,000 tables. We only need to keep the table suffix at 6 digits. The above algorithm can be written more flexibly: /** * Based on UID partitioning algorithm* @param int $uid //User ID * @param int $bit //How many bits are reserved in the table suffix* @param int $seed //How many bits to shift right*/ function getTable( $uid , $bit , $seed ){ return "user_" . sprintf( "%0{$bit}d" , ($uid >> $seed) ); } summary Both of the above methods require us to make the largest possible estimate of the amount of user data in our current system and estimate the maximum load that a single table in the database can handle. For example, in the second solution, if we estimate that our system will have 10 billion users and the optimal amount of data for a single table is 1 million, then we need to move the UID by 20 to ensure that each table has 1 million data, and retain four digits of the user table (user_xxxx) to expand 10,000 tables. Another example is the first solution. If each table has 1 million data and the first two digits are taken after MD5, there will only be 256 tables. The total database of the system is: 256*1 million. If the total data volume of your system is more than this, then you must use MD5 to take the first three, four or even more digits. Both methods divide the data horizontally into different tables. Compared with the first method, the second method is more scalable. Summarize The above is the full content of this article. I hope that the content of this article will have certain reference learning value for your study or work. Thank you for your support of 123WORDPRESS.COM. If you want to learn more about this, please check out the following links You may also be interested in:
|
<<: Detailed explanation of Nginx Location configuration (Location matching order)
>>: Detailed installation process of nodejs management tool nvm
PS: I use PHPStudy2016 here 1. Stop MySQL during ...
1. SHOW PROCESSLIST command SHOW PROCESSLIST show...
Initially, multiple columns have different conten...
The specific code is as follows: <style> #t...
Table of contents 1. Overview 2. Memory Managemen...
This article shares the specific code of how to d...
Question: How to achieve a rounded rectangle usin...
Table of contents What is FormData? A practical e...
This article uses an example to describe how to q...
This article is the second article about objects ...
01 Winter Flakes (Individual only) 02 Snowtop Cap...
The SQL query statement execution order is as fol...
1. Conventional writing in vue2 // The parent com...
Preface During my internship at the company, I us...
VMware Workstation is a powerful desktop virtual ...