Detailed explanation of MySQL database tens of millions of data query and storage

Detailed explanation of MySQL database tens of millions of data query and storage

Million-level data processing solution

Data storage structure design

Table field design

  • The table field is not null. Because null values ​​are difficult to query and take up extra index space, the default number 0 is recommended.
  • For data status type fields, such as status, type, etc., try not to define negative numbers such as -1. Because this way you can add UNSIGNED, the value capacity will be doubled.
  • If possible, use TINYINT, SMALLINT, etc. instead of INT, and try not to use BIGINT because it takes up less space.
  • String type fields take up more space than numeric type fields, so try to use integers instead of strings. In many scenarios, integers can be used instead through coding logic.
  • Do not set the string length arbitrarily. Keep it as small as possible while still meeting the business requirements.
  • Use integer to store IP.
  • A single table should not have too many fields, it is recommended to have less than 20 fields.
  • Reserve in advance for foreseeable fields, because the larger the amount of data, the more time-consuming it is to modify the data structure.

Index design

  • Index, space-for-time optimization strategy, basically design the index according to business needs, enough to cope with millions of data volumes, develop the habit of using explain, you can also visit: explain to make your SQL writing more practical to learn more.
  • A common sense: the more indexes, the better. Indexes will reduce data writing performance.
  • The index field length should be as short as possible, which can save a lot of index space;
  • Canceling foreign keys allows the program to constrain them, which will result in better performance.
  • The composite index matches the leftmost column rule. The order of the indexes should be consistent with the query conditions. Try to remove unnecessary single-column indexes.
  • Fields with less distributed values ​​(fewer unique values) are not suitable for indexing. For example, it makes little sense to index a field like gender that has only two or three values.
  • It is recommended to add indexes to the fields that need to be sorted, because indexes are sorted and can improve query performance.
  • String fields use prefix indexes instead of full-field indexes, which can significantly reduce the index space.

Query statement optimization

  • Try to use short queries instead of complex inline queries.
  • Do not use select * for queries. Try to query fields with indexes to avoid returning to the table.
  • Try to use limit to limit the number of queries.
  • The query fields should be placed on the index as much as possible, especially the composite index, and more attention should be paid to the leftmost prefix match.
  • Splitting large delete/insert operations will lock the table and affect other business operations. On the other hand, MySQL also has a limit on the length of SQL statements.
  • It is not recommended to use MySQL functions and calculations. These can be handled by programs first. From the points mentioned above, you can find that if the program can handle it, try not to transfer the pressure to the database. Because most server performance bottlenecks are in the database.
  • Query count, performance: count(1) = count(*) > count(primary key) > count(other fields).
  • If the query operator can use between, then don't use in; if in can be used, then don't use or.
  • Avoid using operators such as != or <>, IS NULL or IS NOT NULL, IN, NOT IN, etc., because these queries cannot use indexes.
  • Keep the SQL as simple as possible, use fewer joins, and do not recommend more than two joins.

Ten million level data processing solution

Data storage structure design

At this stage, the data itself has great value. In addition to meeting regular business needs, there will also be some needs for data analysis. At this time, the data variability is not high, and basically no modification of the original structure is considered. Generally, optimization is considered from three aspects: partitioning, table sharding, and database sharding:

Partition:

  • Partitioning is a horizontal division in which the database decomposes a table into multiple smaller and more manageable parts based on certain rules. It is completely transparent to the application and does not affect the business logic of the application, that is, there is no need to modify the code. Therefore, more data can be stored, and query and deletion also support operations by partition, thus achieving the purpose of optimization. If you are considering partitioning, you can prepare in advance to avoid the following limitations:
  • A table can have a maximum of 1024 partitions (MySQL 5.6 and later supports 8192 partitions). But in actual operation, it is best not to open more than 100 partitions at a time, because opening partitions also takes time.
  • If there is a primary key or unique index column in the partition field, all primary key columns and unique index columns must be included. If there is a primary key or unique index in the table, the partition key must be the primary key or unique index.
  • Foreign key constraints cannot be used in partitioned tables.
  • NULL values ​​will invalidate partition filtering, and the data will be placed in the default partition. Please do not allow NULL values ​​to appear in the partition field.
  • All partitions must use the same storage engine.

Sub-table:

Sub-tables are divided into horizontal sub-tables and vertical sub-tables.

Horizontal table partitioning is splitting a table into smaller tables with the same data structure, such as table1, table2, etc., to alleviate the database read and write pressure.

Vertical table partitioning is to separate some fields to form a new table. The data structures of each table are different, which can optimize the situation of locking the table under high concurrency.

As you can imagine, if you want to split the table, the logic of the program needs to be modified. Therefore, generally in the early stages of the project, when you foresee a large amount of data, you will consider splitting the table. It is not recommended to divide the table in the later stage because the cost is very high.

Sub-library:

The sub-database is generally in master-slave mode. A database server master node is replicated to multiple databases of one or more slave nodes. The master database is responsible for write operations and the slave database is responsible for read operations, thereby achieving optimization goals such as master-slave separation, high availability, and data backup.

Of course, the master-slave mode also has some defects, such as master-slave synchronization delay, problems caused by too large binlog files, etc. I will not go into details here (the author can't learn it anymore).

other:

Hot and cold meters isolated. For historical data, if there are few people querying and using it, it can be moved to another cold database and only provided for query to alleviate the large amount of data in the hot table.

Database table primary key design

For database primary key design, I personally recommend a self-increasing numeric ID with a time attribute. (Distributed self-increment ID generation algorithm)

  • Snowflake Algorithm
  • Baidu Distributed ID Algorithm
  • Meituan Distributed ID Algorithm

Why use these algorithms? This is related to the MySQL data storage structure.

From a business perspective:

When designing a database, you don't need to think about which field to set as the primary key. Then, these fields are only unique in theory. For example, if the book number is used as the primary key, the book number is only unique in theory, but in practice, duplicates may occur. Therefore, it is better to set an auto-increment ID that is not related to the business as the primary key, and then add a unique constraint for the book number.

Technically speaking:

1. If the table uses an auto-increment primary key, each time a new record is inserted, the record will be added sequentially to the subsequent position of the current index node. When a page is full, a new page will be automatically opened. In general, it can improve the performance of queries and insertions.

2. For InnoDB, the primary key index stores both the index value and the row data in the leaf node, which means that the data file itself stores data in the b+ tree format.

3. If no primary key is defined, a non-empty UNIQUE key will be used as the primary key; if there is no non-empty UNIQUE key, the system generates a 6-byte rowid as the primary key; in a clustered index, N rows form a page (a page is usually 16K in size). If irregular data is inserted, in order to maintain the balance of the B+ tree, frequent page splits and page rotations will occur, and the insertion speed will be slower. Therefore, the primary key value of the clustered index should be a continuously increasing value rather than a random value (do not use a random string or UUID).

4. Therefore, for the primary key of InnoDB, try to use an integer, and an increasing integer. This is very efficient in both storage and query.

MySQL Interview Questions

MySQL database tens of millions of data query optimization solution

The later the limit paging query is, the slower the query will be. This also leads us to a conclusion:

1. The query time of the limit statement is proportional to the position of the starting record.

2. The MySQL limit statement is very convenient, but it is not suitable for direct use for tables with many records.

The table uses InnoDB as the storage engine, id as the auto-increment primary key, and the default primary key index

SELECT id FROM test LIMIT 9000000,100;

There are currently two optimization solutions: using id as the query condition using a subquery and using join.

1. id>= (subquery) form implementation

select * from test where id >= (select id from test limit 9000000,1)limit 0,100

Use the join form;

SELECT * FROM test a JOIN (SELECT id FROM test LIMIT 9000000,100) b ON a.id = b.id

The time taken to use these two types of optimized queries is relatively close. In fact, they both use the same principle, so the effects are similar. But I personally recommend using join and minimizing the use of subqueries. Note: Currently, the query is at the tens of millions level. If it is increased to the millions level, the speed will be faster.

SELECT * FROM test a JOIN (SELECT id FROM test LIMIT 1000000,100) b ON a.id = b.id

Which MySQL storage engines have you used?

What are their characteristics and differences?

This is a question often asked during senior developer interviews. In fact, we often encounter this in our daily development. There are so many MySQL storage engines, but the ones we use most often are InnoDB and MyISAM. So if the interviewer asks what storage engines MySQL has, you only need to tell the two commonly used ones.

So what are their characteristics and differences?

MyISAM: The default table type, which is based on the traditional ISAM type. ISAM is the abbreviation of Indexed Sequential Access Method, which is a standard method for storing records and files. It is not transaction-safe and does not support foreign keys. If a large number of selects are performed, insert MyISAM is more suitable.

InnoDB: An engine that supports transaction security. Its biggest feature is that it supports foreign keys, row locks, and transactions. If there are a large number of updates and inserts, it is recommended to use InnoDB, especially for multiple concurrent and high QPS situations. Note: In versions prior to MySQL 5.5, the default search engine is MyISAM. In versions after MySQL 5.5, the default search engine is changed to InnoDB.

Differences between MyISAM and InnoDB

1. InnoDB supports transactions, but MyISAM does not. For InnoDB, each SQL statement is encapsulated into a transaction by default and automatically committed, which will affect the speed. Therefore, it is best to put multiple SQL statements between begin and commit to form a transaction.

2.InnoDB supports foreign keys, while MyISAM does not.

3. InnoDB is a clustered index and uses B+Tree as the index structure. The data file is bound to the (primary key) index (the table data file itself is an index structure organized by B+Tree). There must be a primary key, and the efficiency of the primary key index is very high. MyISAM is a non-clustered index and also uses B+Tree as the index structure. The index and data files are separate, and the index saves the pointer to the data file. The primary key index and secondary indexes are independent.

4. InnoDB does not save the specific number of rows in the table, and the entire table needs to be scanned when executing select count(*) from table. MyISAM uses a variable to save the number of rows in the entire table. When executing the above statement, you only need to read the variable, which is very fast.

5. Innodb does not support full-text indexing, while MyISAM does. MyISAM has higher query efficiency. InnoDB after 5.7 supports full-text indexing.

6. InnoDB supports table and row-level locks (default), while MyISAM supports table-level locks. ;

7. InnoDB tables must have a primary key (if the user does not specify one, it will find or generate one itself), while Myisam does not have one.

8.Innodb storage files are frm and ibd, while Myisam is frm, MYD, and MYI.

9.Innodb: frm is the table definition file, ibd is the data file.

10.Myisam: frm is the table definition file, myd is the data file, and myi is the index file.

Optimization of MySQL complex query statements

When it comes to complex SQL optimization, most of the time, it is due to multi-table associations that cause a large number of complex SQL statements. So how should we optimize this kind of SQL? There are actually routines for optimization, and we just need to follow the routines. Complex SQL optimization solution:

1. Use the EXPLAIN keyword to check SQL. EXPLAIN can help you analyze the performance bottlenecks of your query statements or table structures. The query results of EXPLAIN will also tell you how your index primary key is used, how your data table is searched and sorted, whether there is a full table scan, etc.

2. Try to use index fields for query conditions. If a table has multiple conditions, try to use composite index queries. When using composite indexes, pay attention to the order of the fields.

3. Use join as much as possible to associate multiple tables and reduce the use of subqueries. If the associated fields of the table can use the primary key, use the primary key, that is, use the index field as much as possible. If the associated field is not an index field, you can consider adding an index based on the situation.

4. Try to use limit for paging batch query, and do not retrieve all at once.

5. Absolutely avoid using select *, try to select specific required fields, and reduce the query of unnecessary fields;

6. Try to convert or to union all.

7. Try to avoid using is null or is not null.

8. Pay attention to the use of like. Pre-blur and full-blur will not use the index.

9. Try to minimize the use of functions in the query field after Where, because functions will cause index failure.

10. Avoid using not equal to (!=) because it does not use the index.

11. Use exists instead of in, and not exists instead of not in, which will be more efficient;

12. Avoid using the HAVING clause. HAVING will only filter the result set after all records are retrieved, which requires sorting, totaling, and other operations. If you can limit the number of records through the WHERE clause, you can reduce this overhead.

13. Never ORDER BY RAND()

The above is a detailed explanation of the query and storage of tens of millions of data in the MySQL database. For more information about the query and storage of tens of millions of data in the MySQL database, please pay attention to other related articles on 123WORDPRESS.COM!

You may also be interested in:
  • MySQL and PHP basics and applications: data query
  • A brief discussion on whether too many MySQL data queries will cause OOM
  • Optimizing the slow query of MySQL aggregate statistics data
  • MySQL json format data query operation
  • MySQL big data query optimization experience sharing (recommended)
  • MySQL and PHP basics and applications: data query statements

<<:  Detailed analysis of compiling and installing vsFTP 3.0.3

>>:  Some key points of website visual design

Recommend

Detailed explanation of Vue's caching method example

Recently, a new requirement "front-end cache...

Vue3 realizes the image magnifying glass effect

This article example shares the specific code of ...

Ideas and codes for implementing waterfall flow layout in uniapp applet

1. Introduction Is it considered rehashing old st...

Introduction to Semantic XHTML Tags

The first point to make is that people can judge t...

Detailed introduction of Chrome developer tools-timeline

1. Overview Users expect the web applications the...

Vue codemirror realizes the effect of online code compiler

Preface If we want to achieve the effect of onlin...

mysql 5.7.11 winx64.zip installation and configuration method graphic tutorial

Install and configure the MySql database system. ...

Example of using store in vue3 to record scroll position

Table of contents Overall Effect Listen for conta...

In-depth explanation of InnoDB locks in MySQL technology

Table of contents Preface 1. What is a lock? 2. L...

JavaScript to implement search data display

This article shares the data display code for Jav...

Vue implements picture verification code when logging in

This article example shares the specific code of ...

Detailed steps for installing nodejs environment and path configuration in Linux

There are two ways to install nodejs in linux. On...