MySQL query optimization: a table optimization solution for 1 million data

MySQL query optimization: a table optimization solution for 1 million data

1. Query speed of two query engines (myIsam engine)

InnoDB does not save the specific number of rows in a table. That is to say, when executing select count(*) from table, InnoDB has to scan the entire table to calculate the number of rows.

MyISAM simply reads the number of saved rows.

Note that when the count(*) statement contains a where condition, the operations for the two tables are slightly different. InnoDB type tables use count(*) or count(primary key) plus the where col condition. The col column is a column other than the primary key of the table that has a unique constraint index. This will make the query very fast. This avoids full table scans.

Summarize:

In MySQL, when there are 3 million records (MyISAM engine), using count(*) to query the total number of records with the condition (setting the index correctly) the running time is normal. For data that is frequently read, we recommend using the myIsam engine.

2. MySQL paging problem with millions of data

We often use paging during development. The core technology is to use limit to read data. During the test of using limit for paging, the following data is obtained:

select * from news order by id desc limit 0,10
It took 0.003 seconds select * from news order by id desc limit 10000,10
It took 0.058 seconds to select * from news order by id desc limit 100000,10 
It took 0.575 seconds to select * from news order by id desc limit 1000000,10
It took 7.28 seconds

We were surprised to find that when the amount of data is large, the larger the paging starting point, the slower the query speed. The query speed for 1 million or more records already takes 7 seconds. This is a number we cannot accept!

Improvement plan 1

select * from news 
where id > (select id from news order by id desc limit 1000000, 1)
order by id desc 
limit 0,10

The query time is 0.365 seconds, and the efficiency improvement is very obvious! ! How does it work? ? ?

We use conditions to filter the id. In the subquery (select id from news order by id desc limit 1000000, 1), we only query the id field, which saves a lot of query overhead compared to select * or select multiple fields!

Improvement plan 2

Suitable for systems with continuous IDs, extremely fast!

select * from news 
where id between 1000000 and 1000010 
order by id desc

It is not suitable for queries with conditions and discontinuous ids. Very fast!

3. Things to note when querying MySQL conditions and paging queries with millions of data

Continuing from the previous section, we add the query conditions:

select id from news 
where cate = 1
order by id desc 
limit 500000 ,10 
Query time 20 seconds

What a terrifying speed! ! Use the knowledge from the first section to optimize:

select * from news
where cate = 1 and id > (select id from news where cate = 1 order by id desc limit 500000,1 ) 
order by id desc 
limit 0,10 
Query time 15 seconds

The optimization effect is not obvious, and the impact of conditions is still very large! In this case, no matter how we optimize the SQL statement, we cannot solve the problem of operating efficiency. Then change the idea: create an index table to record only the article ID and classification information, and separate the large field of article content.

Table news2 [Article table engine myisam character set utf-8]

id int 11 Primary key automatically increases

cate int 11 Index

When writing data, synchronize the two tables. When querying, you can use news2 to perform conditional queries:

select * from news
where cate = 1 and id > (select id from news2 where cate = 1 order by id desc limit 500000,1 ) 
order by id desc 
limit 0,10

Note that the condition id > uses the news2 table!

The running time is 1.23 seconds. We can see that the running time has been reduced by nearly 20 times! ! When the data is around 100,000, the query time can be kept at around 0.5 seconds, which is gradually approaching a value we can tolerate!

But 1 second is still an unacceptable value for the server! ! Is there any other way to optimize it? ? We tried a great variation:

Changing the storage engine of news2 to innodb, the results are amazing!

select * from news
where cate = 1 and id > (select id from news2 where cate = 1 order by id desc limit 500000,1 ) 
order by id desc 
limit 0,10

It only takes 0.2 seconds, which is really fast.

4. The difference between MySQL storage engine myIsam and innodb

MySQL has multiple storage engines, MyISAM and InnoDB are two commonly used ones. Here are some basic concepts about these two engines (not an in-depth introduction).

The MyISAM storage engine, based on the traditional ISAM type, supports full-text search, but is not transaction-safe and does not support foreign keys. Each MyISAM table is stored in three files: the frm file stores the table definition; the data file is MYD (MYData); and the index file is MYI (MYIndex).

InnoDB is a transactional engine that supports rollback, crash recovery, multi-version concurrency control, ACID transactions, row-level locking (the row locks of InnoDB tables are not absolute. If MySQL cannot determine the range to be scanned when executing a SQL statement, the InnoDB table will also lock the entire table, such as the SQL statement during the like operation), and provides a non-locking reading method consistent with the Oracle type. InnoDB stores its tables and indexes in a tablespace, which can contain several files.

Core Differences

MyISAM is non-transaction-safe, while InnoDB is transaction-safe.

The granularity of MyISAM locks is table level, while InnoDB supports row level locking.

MyISAM supports full-text indexing, while InnoDB does not.

MyISAM is relatively simple, so it is more efficient than InnoDB. Small applications can consider using MyISAM.

MyISAM tables are saved in the form of files. Using MyISAM storage in cross-platform data transfer will save a lot of trouble.

InnoDB tables are safer than MyISAM tables. You can switch from non-transactional tables to transactional tables (alter table tablename type=innodb) without losing data.

Application Scenario

MyISAM manages non-transactional tables. It provides high-speed storage and retrieval, as well as full-text search capabilities. If your application needs to execute a large number of SELECT queries, MyISAM is a better choice.

InnoDB is designed for transaction processing applications and has many features, including ACID transaction support. If a large number of INSERT or UPDATE operations need to be performed in your application, InnoDB should be used to improve the performance of multi-user concurrent operations.

Mysql storage engine and index

The database must have an index. Without an index, the retrieval process becomes a sequential search, and the time complexity of O(n) is almost unbearable. It is very easy to imagine how to use a B+ tree to index a table consisting of only a single keyword, as long as the keyword is stored in the node of the tree. When a record in the database contains multiple fields, a B+ tree can only store the primary key. If a non-primary key field is retrieved, the primary key index loses its function and it becomes a sequential search. At this time, a second set of indexes should be created on the second column to be retrieved. The index is organized as separate B+ trees. There are two common methods to solve the problem of multiple B+ trees accessing the same set of table data, one is called a clustered index and the other is called a non-clustered index (secondary index). Although both names are called index, this is not a separate index type, but a way of storing data. For clustered index storage, row data and primary key B+ tree are stored together, and secondary key B+ tree only stores secondary key and primary key. Primary key and non-primary key B+ tree are almost two types of trees. For non-clustered index storage, the primary key B+ tree stores pointers to the actual data rows in the leaf nodes instead of the primary keys.

InnoDB uses a clustered index to organize the primary key into a B+ tree, and the row data is stored on the leaf nodes. If you use a condition such as "where id = 14" to search for the primary key, you can find the corresponding leaf node according to the B+ tree retrieval algorithm, and then get the row data. If you perform a conditional search on the Name column, two steps are required: the first step is to retrieve Name in the auxiliary index B+ tree and reach its leaf node to obtain the corresponding primary key. The second step is to use the primary key to perform another B+ tree search operation in the primary index B+ tree, and finally reach the leaf node to obtain the entire row of data.

MyISM uses a non-clustered index. The two B+ trees of the non-clustered index look no different. The structure of the nodes is exactly the same, but the stored content is different. The nodes of the primary key index B+ tree store the primary key, and the nodes of the secondary key index B+ tree store the secondary key. The table data is stored in an independent place. The leaf nodes of these two B+ trees use one address to point to the real table data. For the table data, there is no difference between these two keys. Since the index trees are independent, retrieval by secondary key does not require access to the primary key's index tree.

To illustrate the difference between the two indexes more vividly, let's imagine a table that stores 4 rows of data as shown below. Id is the primary index and Name is the secondary index. The diagram clearly shows the difference between clustered indexes and non-clustered indexes.

We focus on clustered indexes. It seems that the efficiency of clustered indexes is obviously lower than that of non-clustered indexes, because each retrieval using auxiliary indexes requires two B+ tree searches. Isn't this redundant? What are the advantages of clustered indexes?

1 Since row data and leaf nodes are stored together, the primary key and row data are loaded into memory together. Once the leaf node is found, the row data can be returned immediately. If the data is organized according to the primary key ID, data can be obtained faster.

2 The advantage of using the primary key as a "pointer" instead of the address value as a pointer for the auxiliary index is that it reduces the maintenance work of the auxiliary index when rows are moved or data pages are split. Using the primary key value as a pointer will make the auxiliary index take up more space, but the benefit is that InnoDB does not need to update the "pointer" in the auxiliary index when moving rows. That is to say, the position of the row (located by 16K Page in the implementation, which will be discussed later) will change as the data in the database is modified (the previous B+ tree node split and Page split). The use of clustered index can ensure that no matter how the nodes of the primary key B+ tree change, the auxiliary index tree will not be affected.

Therefore, when it comes to millions of data or more, MySQL InnoDB's index performance is even better!

5. Some experience in MySQL performance optimization

a. Optimize your query for query

Most MySQL servers have query cache enabled. This is one of the most effective ways to improve performance, and it is handled by the MySQL database engine. When many identical queries are executed multiple times, the query results will be put into a cache, so that subsequent identical queries can directly access the cached results without operating the table.

The main problem here is that this matter is very easy to be overlooked by programmers. Because some of our query statements will cause MySQL not to use the cache.

Take a look at the following example:

// Query cache is not enabled

$r = mysql_query("SELECT username FROM user WHERE signup_date >= CURDATE()");

// Enable query cache

$today = date("Ymd");

$r = mysql_query("SELECT username FROM user WHERE signup_date >= '$today'");

The difference between the two SQL statements above is CURDATE(). MySQL query cache does not work for this function. Therefore, SQL functions like NOW() and RAND() or other similar functions will not enable query caching because the returns of these functions are uncertain and volatile. So, all you need is to replace the MySQL function with a variable to enable caching.

b. Learn to use EXPLAIN

Using the EXPLAIN keyword allows you to see how MySQL processes your SQL statements.

select id, title, cate from news where cate = 1

If you find that the query is slow, then adding an index on the cate field will speed up the query.

c. Use LIMIT 1 when only one row of data is needed

When you query a table and only need one piece of data, use limit 1.

d. Use indexes correctly

Indexes are not necessarily for primary keys or unique fields. If there is a field in your table that you often use for searching, taking pictures, or conditions, then please create an index for it.

e. Do not ORDER BY RAND()

A very inefficient random query.

f. Avoid SELECT *

The more data you read from the database, the slower the query will become. Also, if your database server and web server are two independent servers, this will increase the load on network transmission. You must develop a good habit of taking whatever you need.

g. Use ENUM instead of VARCHAR

The ENUM type is very fast and compact. In reality, it stores a TINYINT, but appears as a string. This makes it perfect for making a list of options.

If you have a field such as "Gender", "Country", "Ethnicity", "State" or "Department" that you know will have a limited number of values, then you should use ENUM instead of VARCHAR.

h. Using NOT NULL

Unless you have a very specific reason to use NULL values, you should always keep your columns NOT NULL. This may seem a bit controversial, please read on.

First, ask yourself how different is "Empty" from "NULL" (in the case of INT, that would be 0 and NULL)? If you think there is no difference between them, then you should not use NULL. (Did you know that in Oracle, NULL and Empty are the same string?)

Don't assume that NULL requires no space; it does require extra space, and your program will be more complicated when you do comparisons. Of course, this does not mean that you cannot use NULL. The reality is very complicated, and there are still some cases where you need to use NULL values.

The following is excerpted from MySQL's own documentation

“NULL columns require additional space in the row to record whether their values ​​are NULL. For MyISAM tables, each NULL column takes one bit extra, rounded up to the nearest byte.”

i. IP address is stored as UNSIGNED INT

Many programmers will create a VARCHAR(15) field to store the string IP instead of the integer IP. If you use an integer to store it, it only takes 4 bytes, and you can have fixed-length fields. Moreover, this will bring you advantages in querying, especially when you need to use such WHERE conditions: IP between ip1 and ip2.

We must use UNSIGNED INT because the IP address uses the entire 32-bit unsigned integer.

j. Fixed-length tables are faster

If all fields in a table are "fixed-length", the entire table is considered to be "static" or "fixed-length". For example, the table does not have fields of the following types: VARCHAR, TEXT, BLOB. As long as you include one of these fields, the table is no longer a "fixed-length static table", and the MySQL engine will handle it in another way.

Fixed-length tables will improve performance because MySQL will search faster, and since these fixed lengths make it easy to calculate the offset of the next data, reading will naturally be fast. If the field is not of fixed length, then each time you want to find the next entry, the program needs to find the primary key.

Also, fixed-length tables are easier to cache and rebuild. However, the only side effect is that fixed-length fields will waste some space, because fixed-length fields will allocate that much space regardless of whether you use it or not.

k. Vertical split

"Vertical splitting" is a method of dividing a table in a database into several tables by columns, which can reduce the complexity of the table and the number of fields, thereby achieving the purpose of optimization. It should be noted that you should not frequently join the tables formed by these separated fields. Otherwise, the performance will be worse than when the fields are not separated, and the performance will drop exponentially.

l. Split large DELETE or INSERT statements

If you perform a large DELETE or INSERT query on a live website, you need to be very careful to avoid your operation bringing your entire website down. Because these two operations will lock the table. Once the table is locked, no other operations can be performed.

Apache will have many child processes or threads. Therefore, it works quite efficiently, and our server does not want to have too many subprocesses, threads, and database links, which will greatly occupy server resources, especially memory.

If you lock your table for a period of time, such as 30 seconds, then for a site with a high traffic volume, the accumulated access processes/threads, database links, and number of open files in these 30 seconds may not only cause your web service to crash, but may also cause your entire server to hang up immediately.

m. Smaller columns will be faster

For most database engines, disk operations are probably the most significant bottleneck. So, making your data compact can be very helpful in this situation, as it reduces the number of accesses to the hard drive.

n. Choose the right storage engine

There are two storage engines in MySQL: MyISAM and InnoDB, each of which has its pros and cons.

MyISAM is suitable for some applications that require a lot of queries, but it is not very good for applications with a lot of write operations. Even if you just need to update a field, the entire table will be locked, and other processes, even read processes, cannot operate until the read operation is completed. In addition, MyISAM is extremely fast for calculations such as SELECT COUNT(*).

The trend of InnoDB is to be a very complex storage engine, and for some small applications, it will be slower than MyISAM. It supports "row locks", so it performs better when there are more write operations. In addition, it also supports more advanced applications, such as transactions.

This concludes this article on MySQL query optimization and a table optimization solution for 1 million records. For more relevant MySQL query optimization content, please search for previous articles on 123WORDPRESS.COM or continue to browse the following related articles. I hope you will support 123WORDPRESS.COM in the future!

You may also be interested in:
  • Introduction to MySQL (I) Basic operations of data tables and databases
  • Why should the number of rows in a single MySQL table not exceed 5 million?
  • MySQL database table and database partitioning strategy
  • How to query data from multiple unrelated tables and paging in Mysql
  • MySQL data table partitioning strategy and advantages and disadvantages analysis
  • Interview question: How much data can a MySQL table store?

<<:  Web page html special symbols html special characters comparison table

>>:  How to monitor oracle database using zabbix agent2

Recommend

win10 docker-toolsbox tutorial on building a php development environment

Download image docker pull mysql:5.7 docker pull ...

A simple way to implement all functions of shopping cart in Vue

The main functions are as follows: Add product in...

Color matching techniques and effect display for beauty and styling websites

Color is one of the most important elements for a...

MySql index detailed introduction and correct use method

MySql index detailed introduction and correct use...

Detailed steps to install python3.7 on CentOS6.5

1. Download Python 3 wget https://www.python.org/...

Summary of Mysql high performance optimization skills

Database Command Specification All database objec...

Detailed explanation of HTML basics (Part 2)

1. List The list ul container is loaded with a fo...

Analysis of the principles of docker containers

Table of contents 01 What is the essence of a con...

Summary of using MySQL online DDL gh-ost

background: As a DBA, most of the DDL changes of ...

Analysis of the advantages of path.join() in Node.js

You might be wondering why you should use the pat...

Detailed explanation of the two modes of Router routing in Vue: hash and history

hash mode (default) Working principle: Monitor th...

Detailed explanation of nginx upstream configuration and function

Configuration Example upstream backend { server b...

JavaScript canvas to load pictures

This article shares the specific code of JavaScri...

New settings for text and fonts in CSS3

Text Shadow text-shadow: horizontal offset vertic...

A brief discussion on VUE uni-app custom components

1. Parent components can pass data to child compo...