Today I saw a case study on MySQL IN subquery optimization. I was a little skeptical at first (this would be impossible in SQL Server, but I’ll do a simple test later.) Then I made a table to test and verify it according to what he said, and found that the IN subquery of MySQL was not done well, which would indeed lead to the situation where the index could not be used (the IN subquery could not be used, so the scenario was MySQL, and the ending version was 5.7.18) MySQL test environment The test table is as follows create table test_table2 ( id int auto_increment primary key, pay_id int, pay_time datetime, other_col varchar(100) ) Create a stored procedure to insert test data. The characteristic of the test data is that pay_id is repeatable. Here, in the stored procedure, a repeated pay_id is inserted every 100 data items during the loop insertion of 3 million data items. The time field is random within a certain range. CREATE DEFINER=`root`@`%` PROCEDURE `test_insert`(IN `loopcount` INT) LANGUAGE SQL NOT DETERMINISTIC CONTAINS SQL SQL SECURITY DEFINER COMMENT '' BEGIN declare cnt int; set cnt = 0; while cnt< loopcount do insert into test_table2 (pay_id,pay_time,other_col) values (cnt,date_add(now(), interval floor(300*rand()) day),uuid()); if (cnt mod 100 = 0) then insert into test_table2 (pay_id,pay_time,other_col) values (cnt,date_add(now(), interval floor(300*rand()) day),uuid()); end if; set cnt = cnt + 1; end while; END Execute call test_insert(3000000); insert 303000 rows of data Two ways to write subqueries The query roughly means to query the data with business ID greater than 1 within a certain time period, so there are two ways of writing it. The first way of writing is as follows: the IN subquery contains the business ID whose business statistics are greater than 1 within a certain period of time. The outer layer queries according to the result of the IN subquery. The pay_id column of the business ID has an index, and the logic is relatively simple. This way of writing is indeed inefficient when the amount of data is large, and no index is needed. select * from test_table2 force index(idx_pay_id) where pay_id in ( select pay_id from test_table2 where pay_time>="2016-06-01 00:00:00" AND pay_time<="2017-07-03 12:59:59" group by pay_id having count(pay_id) > 1 ); Execution result: 2.23 seconds The second way of writing is to join with the subquery. This way of writing is equivalent to the above IN subquery writing. The following test found that the efficiency is indeed improved a lot. select tpp1.* from test_table2 tpp1, ( select pay_id from test_table2 WHERE pay_time>="2016-07-01 00:00:00" AND pay_time<="2017-07-03 12:59:59" group by pay_id having count(pay_id) > 1 ) tpp2 where tpp1.pay_id=tpp2.pay_id Execution result: 0.48 seconds In the execution plan of the subquery, it is found that the outer query is a full table scan, and the index on pay_id is not used. In the execution plan of the join self-check, the outer layer (query of the tpp1 alias) uses the index on pay_id. Later, I wanted to use a forced index for the first query method. Although there was no error, I found that it was useless. If the subquery is a direct value, the index can be used normally. It can be seen that MySQL's support for IN subqueries is really not very good. In addition: adding a case of using a temporary table, although it is more efficient than many join queries, it is also more efficient than directly using the IN subquery. In this case, the index can also be used, but in this simple case, there is no need to use a temporary table. The following is a similar case test in SQL Server 2014. The test table structure and number are exactly the same. It can be seen that in this case, the two writing methods can be considered to be exactly the same in SQL Server (execution plan + efficiency). In this respect, SQL Server is much better than MySQL. Below is the test environment script in sqlserver. create table test_table2 ( id int identity(1,1) primary key, pay_id int, pay_time datetime, other_col varchar(100) ) begin tran declare @i int = 0 while @i<300000 begin insert into test_table2 values (@i,getdate()-rand()*300,newid()); if(@i%1000=0) begin insert into test_table2 values (@i,getdate()-rand()*300,newid()); end set @i = @i + 1 end COMMIT GO create index idx_pay_id on test_table2(pay_id); create index idx_time on test_table2(pay_time); GO select * from test_table2 where pay_id in ( select pay_id from test_table2 where pay_time>='2017-01-21 00:00:00' AND pay_time<='2017-07-03 12:59:59' group by pay_id having count(pay_id) > 1 ); select tpp1.* from test_table2 tpp1, ( select pay_id from test_table2 WHERE pay_time>='2017-01-21 00:00:00' AND pay_time<='2017-07-30 12:59:59' group by pay_id having count(pay_id) > 1 ) tpp2 where tpp1.pay_id=tpp2.pay_id Summary: In MySQL data, as of version 5.7.18, IN subqueries should still be used with caution You may also be interested in:
|
<<: WeChat applet + mqtt, esp8266 temperature and humidity reading implementation method
>>: Detailed steps for yum configuration of nginx reverse proxy
Preface In a previous project, the CASE WHEN sort...
This article uses examples to illustrate the prin...
Table of contents verify: Combined with the examp...
Previously, we knew several attributes of backgro...
This article uses an example to describe how to r...
illustrate When you install the system yourself, ...
Migration is unavoidable in many cases. Hardware ...
Table of contents JavaScript prototype chain Obje...
Table of contents 1 element offset series 1.1 Off...
Sometimes we want to execute a command in a conta...
<br />Every family has its own problems, and...
The data backup operation is very easy. Execute t...
Table of contents Docker images What is a mirror?...
Using CI to build docker images for release has g...
If you have a choice, you should use UTF-8 In fac...