Why should the number of rows in a single MySQL table not exceed 5 million?

Why should the number of rows in a single MySQL table not exceed 5 million?

Today, let’s discuss an interesting topic: How much data does a single MySQL table need before you need to consider splitting the database and tables? Some say 20 million rows, others say 5 million rows. So, what value do you think is appropriate?

There was once a widely circulated saying in China's Internet technology circle: If the amount of data in a single MySQL table exceeds 20 million rows, the performance will be significantly reduced. In fact, this rumor is said to have originated from Baidu. The specific situation is roughly like this. When the DBA tested MySQL performance, he found that when the volume of a single table reached 20 million rows, the performance of SQL operations dropped sharply. Therefore, the conclusion came from this. Then it was said that Baidu's engineers moved to other companies in the industry and brought this information with them, so this rumor spread in the industry.

Later, Alibaba's "Java Development Manual" recommended splitting the database and tables only when the number of rows in a single table exceeds 5 million or the capacity of a single table exceeds 2GB. Alibaba's golden rule supports this. Therefore, many people use this as a standard to perform table partitioning when designing big data storage.

So, what value do you think is appropriate? Why not 3 million rows, or 8 million rows, but 5 million rows? Maybe you would say that this might be Ali's best actual combat value? So, the question is, how is this value evaluated? Wait a moment, please think about it for a moment.

In fact, this value has nothing to do with the actual number of records, but is related to the configuration of MySQL and the hardware of the machine. Because, in order to improve performance, MySQL will load the table's index into memory. If the InnoDB buffer size is sufficient, it can load all the data into memory and there will be no problem with the query. However, when a single-table database reaches a certain upper limit, the memory cannot store its index, so subsequent SQL queries will generate disk IO, resulting in performance degradation. Of course, this is also related to the design of the specific table structure, and the final problem is memory limitation. Here, increasing the hardware configuration may bring immediate performance improvements.

So, my opinion on database and table sharding is that it needs to be combined with actual needs and should not be over-designed. Database and table sharding should not be adopted at the beginning of the project. Instead, as the business grows and optimization cannot be continued, database and table sharding should be considered to improve system performance. In this regard, Alibaba's "Java Development Manual" adds: If you estimate that the amount of data will not reach this level in three years, please do not split the database or table when creating the table. So, back to the original question, what value do you think is appropriate? My suggestion is to make a comprehensive assessment based on the situation of your own machine. If you don't have a standard in mind, then temporarily use 5 million rows as a unified standard, which is a relatively compromising value.

Let's take a look at some points about SQL writing, which will be helpful to everyone

SQL writing needs to be optimized

  • Use limit to limit the records in the query results
  • Avoid select * and list the fields you need to search for.
  • Use joins instead of subqueries
  • Split large delete or insert statements
  • You can find out the slow SQL by turning on the slow query log
  • No column operations: SELECT id WHERE age + 1 = 10. Any operation on the column will result in a table scan, including database tutorial functions, calculation expressions, etc. When querying, try to move the operation to the right of the equal sign.
  • The SQL statement should be as simple as possible: one SQL statement can only be executed on one CPU; large statements should be split into small statements to reduce the lock time; one large SQL statement can block the entire database
  • Rewrite OR to IN: OR has an efficiency of n, while IN has an efficiency of log(n). The number of INs is recommended to be controlled within 200.
  • No functions or triggers are needed to implement
  • Avoid %xxx-style queries
  • Use JOIN sparingly
  • Use the same type for comparison, such as '123' and '123', 123 and 123
  • Try to avoid using the != or <> operators in the WHERE clause, otherwise the engine will abandon the index and perform a full table scan.
  • For consecutive values, use BETWEEN instead of IN: SELECT id FROM t WHERE num BETWEEN 1 AND 5
  • Don't use the entire table for list data. Use LIMIT to split the data into pages. The number of pages should not be too large.
You may also be interested in:
  • MySQL query optimization: a table optimization solution for 1 million data
  • Introduction to MySQL (I) Basic operations of data tables and databases
  • MySQL database table and database partitioning strategy
  • How to query data from multiple unrelated tables and paging in Mysql
  • MySQL data table partitioning strategy and advantages and disadvantages analysis
  • Interview question: How much data can a MySQL table store?

<<:  Implementation and usage scenarios of JS anti-shake throttling function

>>:  Solve the problem of installing Theano on Ubuntu 19

Recommend

Tutorial on installing mysql5.7.18 on mac os10.12

I searched the entire web and found all kinds of ...

Vue dynamic menu, dynamic route loading and refresh pitfalls

Table of contents need: Ideas: lesson: Share the ...

Example code of the spread operator and its application in JavaScript

The spread operator allows an expression to be ex...

HTML multimedia application: inserting flash animation and music into web pages

1. Application of multimedia in HTML_falsh animat...

Advantages and disadvantages of common MySQL storage engines

Table of contents View all storage engines InnoDB...

Simple implementation method of Linux process monitoring and automatic restart

Purpose: Under Linux, the server program may be d...

How to bind Docker container to external IP and port

Docker allows network services to be provided by ...

HTML basics - CSS style sheets, style attributes, format and layout details

1. position : fixed Locked position (relative to ...

Ubuntu 19.10 enables ssh service (detailed process)

It took me more than an hour to open ssh in Ubunt...

A brief analysis of the game kimono memo problem

Today, after the game was restarted, I found that...

How to solve the error of PyCurl under Linux

Solution to "Could not run curl-config"...

Vue implements verification whether the username is available

This article example shares the specific code of ...

How to hide elements on the Web and their advantages and disadvantages

Example source code: https://codepen.io/shadeed/p...