Solve the problem that IN subquery in MySQL will cause the index to be unusable

Today I saw a case study on MySQL IN subquery optimization.

I was a little skeptical at first (this would be impossible in SQL Server, but I’ll do a simple test later.)

Then I made a table to test and verify it according to what he said, and found that the IN subquery of MySQL was not done well, which would indeed lead to the situation where the index could not be used (the IN subquery could not be used, so the scenario was MySQL, and the ending version was 5.7.18)

MySQL test environment

The test table is as follows

create table test_table2
(
  id int auto_increment primary key,
  pay_id int,
  pay_time datetime,
  other_col varchar(100)
)

Create a stored procedure to insert test data. The characteristic of the test data is that pay_id is repeatable. Here, in the stored procedure, a repeated pay_id is inserted every 100 data items during the loop insertion of 3 million data items. The time field is random within a certain range.

CREATE DEFINER=`root`@`%` PROCEDURE `test_insert`(IN `loopcount` INT)
  LANGUAGE SQL
  NOT DETERMINISTIC
  CONTAINS SQL
  SQL SECURITY DEFINER
  COMMENT ''
BEGIN
  declare cnt int;
  set cnt = 0;
  while cnt< loopcount do
    insert into test_table2 (pay_id,pay_time,other_col) values (cnt,date_add(now(), interval floor(300*rand()) day),uuid());
    if (cnt mod 100 = 0) then
      insert into test_table2 (pay_id,pay_time,other_col) values (cnt,date_add(now(), interval floor(300*rand()) day),uuid());
    end if;
    set cnt = cnt + 1;  
  end while;
END

Execute call test_insert(3000000); insert 303000 rows of data

Two ways to write subqueries

The query roughly means to query the data with business ID greater than 1 within a certain time period, so there are two ways of writing it.

The first way of writing is as follows: the IN subquery contains the business ID whose business statistics are greater than 1 within a certain period of time. The outer layer queries according to the result of the IN subquery. The pay_id column of the business ID has an index, and the logic is relatively simple. This way of writing is indeed inefficient when the amount of data is large, and no index is needed.

select * from test_table2 force index(idx_pay_id)
where pay_id in (
  select pay_id from test_table2 
  where pay_time>="2016-06-01 00:00:00" 
    AND pay_time<="2017-07-03 12:59:59" 
  group by pay_id 
  having count(pay_id) > 1
);

Execution result: 2.23 seconds

The second way of writing is to join with the subquery. This way of writing is equivalent to the above IN subquery writing. The following test found that the efficiency is indeed improved a lot.

select tpp1.* from test_table2 tpp1, 
(
   select pay_id 
   from test_table2 
   WHERE pay_time>="2016-07-01 00:00:00" 
   AND pay_time<="2017-07-03 12:59:59" 
   group by pay_id 
   having count(pay_id) > 1
) tpp2 
where tpp1.pay_id=tpp2.pay_id

Execution result: 0.48 seconds

In the execution plan of the subquery, it is found that the outer query is a full table scan, and the index on pay_id is not used.

In the execution plan of the join self-check, the outer layer (query of the tpp1 alias) uses the index on pay_id.

Later, I wanted to use a forced index for the first query method. Although there was no error, I found that it was useless.

If the subquery is a direct value, the index can be used normally.

It can be seen that MySQL's support for IN subqueries is really not very good.

In addition: adding a case of using a temporary table, although it is more efficient than many join queries, it is also more efficient than directly using the IN subquery. In this case, the index can also be used, but in this simple case, there is no need to use a temporary table.

The following is a similar case test in SQL Server 2014. The test table structure and number are exactly the same. It can be seen that in this case, the two writing methods can be considered to be exactly the same in SQL Server (execution plan + efficiency). In this respect, SQL Server is much better than MySQL.

Below is the test environment script in sqlserver.

create table test_table2
(
  id int identity(1,1) primary key,
  pay_id int,
  pay_time datetime,
  other_col varchar(100)
)
begin tran
declare @i int = 0
while @i<300000
begin
  insert into test_table2 values (@i,getdate()-rand()*300,newid());
  if(@i%1000=0)
  begin
    insert into test_table2 values (@i,getdate()-rand()*300,newid());
  end
  set @i = @i + 1
end
COMMIT
GO
create index idx_pay_id on test_table2(pay_id);
create index idx_time on test_table2(pay_time);
GO
select * from test_table2 
where pay_id in (
          select pay_id from test_table2 
          where pay_time>='2017-01-21 00:00:00' 
          AND pay_time<='2017-07-03 12:59:59' 
          group by pay_id 
          having count(pay_id) > 1
        );
select tpp1.* from test_table2 tpp1, 
(
   select pay_id 
   from test_table2 
   WHERE pay_time>='2017-01-21 00:00:00'
   AND pay_time<='2017-07-30 12:59:59' 
   group by pay_id having 
   count(pay_id) > 1
) tpp2 
where tpp1.pay_id=tpp2.pay_id

Summary: In MySQL data, as of version 5.7.18, IN subqueries should still be used with caution

You may also be interested in:

Subquery examples in MySQL
Detailed description of subquery operations in MySQL
Detailed explanation of MySQL subqueries (nested queries), join tables, and combined queries
Examples of optimization techniques for slow query efficiency in MySQL IN statements
MySQL optimization: use join instead of subquery
Mysql subquery IN using LIMIT application example
MYSQL subquery and nested query optimization example analysis
MySQL implements multi-table association statistics (subquery statistics) example
MySQL Notes: Introduction to Subquery Usage
Solution to the problem that order by is not effective in MySQL subquery

<<: WeChat applet + mqtt, esp8266 temperature and humidity reading implementation method

>>: Detailed steps for yum configuration of nginx reverse proxy