Implementation of MySQL select in subquery optimization

Implementation of MySQL select in subquery optimization

The following demonstration is based on MySQL version 5.7.27

1. Introduction to MySQL subquery optimization strategy:

Subquery optimization strategy

The optimizer chooses different strategies for different types of subqueries.

1. For IN and =ANY subqueries, the optimizer has the following strategy choices:

  • semijoin
  • Materialization
  • exists

2. For NOT IN and <> ALL subqueries, the optimizer has the following strategy choices:

  • Materialization
  • exists

3. For derived tables, the optimizer has the following strategy options:
derived_merge, merges a derived table into an outer query (introduced in 5.7);
Materialize the derived table into an internal temporary table and then use it in the external query.
Note: Subqueries in update and delete statements cannot use semijoin or materialization optimization strategies.

2. Create data for simulation demonstration

In order to facilitate the analysis of the problem, create two tables and insert simulated data:

CREATE TABLE `test02` (
 `id` int(11) NOT NULL,
 `a` int(11) DEFAULT NULL,
 `b` int(11) DEFAULT NULL,
 PRIMARY KEY (`id`),
 KEY `a` (`a`)
)ENGINE=InnoDB;

drop procedure idata;
delimiter;;
create procedure idata()
begin
 declare i int;
 set i=1;
 while(i<=10000)do
  insert into test02 values(i, i, i);
  set i=i+1;
 end while;
end;;
delimiter ;
call idata();

create table test01 like test02;
insert into test01 (select * from test02 where id<=1000)

3. Example Analysis of SQL Instances

Subquery example:

SELECT * FROM test01 WHERE test01.a IN (SELECT test02.b FROM test02 WHERE id < 10)

Most people would simply think that this SQL will be executed like this:

SELECT test02.b FROM test02 WHERE id < 10

Results: 1, 2, 3, 4, 5, 6, 7, 8, 9

SELECT * FROM test01 WHERE test01.a IN (1,2,3,4,5,6,7,8,9);

But that's not actually how MySQL works. MySQL will push the related outer table into the subquery, and the optimizer believes that this is more efficient. In other words, the optimizer will rewrite the above SQL into this:

select * from test01 where exists(select b from test02 where id < 10 and test01.a=test02.b);

Tip: For MySQL 5.5 and earlier versions

Check the execution plan as follows and find that this SQL performs a full table scan 1000 times on table test01, which is inefficient:

root@localhost [dbtest01]>desc select * from test01 where exists(select b from test02 where id < 10 and test01.a=test02.b);
+----+--------------------+--------+------------+-------+---------------+---------+---------+------+--------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+--------------------+--------+------------+-------+---------------+---------+---------+------+--------+----------+-------------+
| 1 | PRIMARY | test01 | NULL | ALL | NULL | NULL | NULL | NULL | 1000 | 100.00 | Using where |
| 2 | DEPENDENT SUBQUERY | test02 | NULL | range | PRIMARY | PRIMARY | 4 | NULL | 9 | 10.00 | Using where |
+----+--------------------+--------+------------+-------+---------------+---------+---------+------+--------+----------+-------------+
2 rows in set, 2 warnings (0.00 sec)

But when we actually execute the following SQL, we find that it is not slow at all. Isn't this a contradiction? Don't worry, let's continue to analyze it:

SELECT * FROM test01 WHERE test01.a IN (SELECT test02.b FROM test02 WHERE id < 10)

Check the execution plan of this SQL as follows:

root@localhost [dbtest01]>desc SELECT * FROM test01 WHERE test01.a IN (SELECT test02.b FROM test02 WHERE id < 10);
+----+--------------+-------------+------------+-------+---------------+--------+---------+---------------+------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+--------------+-------------+------------+-------+---------------+--------+---------+---------------+------+----------+-------------+
| 1 | SIMPLE | <subquery2> | NULL | ALL | NULL | NULL | NULL | NULL | NULL | 100.00 | Using where |
| 1 | SIMPLE | test01 | NULL | ref | a | a | 5 | <subquery2>.b | 1 | 100.00 | NULL |
| 2 | MATERIALIZED | test02 | NULL | range | PRIMARY | PRIMARY | 4 | NULL | 9 | 100.00 | Using where |
+----+--------------+-------------+------------+-------+---------------+--------+---------+---------------+------+----------+-------------+
3 rows in set, 1 warning (0.00 sec)

It is found that the optimizer uses the MATERIALIZED strategy. So I searched for information and studied this strategy.
https://dev.mysql.com/doc/refman/5.6/en/subquery-optimization.html

The reason is that since MySQL 5.6, including MySQL 5.6, the optimizer has introduced new optimization strategies: materialization=[off|on], semijoin=[off|on], (off means turning off this strategy, on means turning on this strategy)
You can use show variables like 'optimizer_switch'; to view the optimizer strategy used by MySQL. Of course, these strategies can be modified dynamically online.
set global optimizer_switch='materialization=on,semijoin=on'; represents turning on the optimization strategies materialization and semijoin

The default optimizer strategy for MySQL 5.7.27 is:

root@localhost [dbtest01]>show variables like 'optimizer_switch';                                                               
+------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Variable_name | Value |
+------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| optimizer_switch | index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,engine_condition_pushdown=on,index_condition_pushdown=on,mrr=on,mrr_cost_based=on,block_nested_loop=on,batched_key_access=off,materialization=on,semijoin=on,loosescan=on,firstmatch=on,duplicateweedout=on,subquery_materialization_cost_based=on,use_index_extensions=on,condition_fanout_filter=on,derived_merge=on |
+------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

So in MySQL 5.6 and above

Executing the following SQL will not be slow. Because MySQL optimizer strategies materialization and semijoin optimize this SQL

SELECT * FROM test01 WHERE test01.a IN (SELECT test02.b FROM test02 WHERE id < 10)

However, we turned off the MySQL optimizer strategies materialization and semijoin for testing, and found that SQL did scan the entire table of test01 (1000):

set global optimizer_switch='materialization=off,semijoin=off';

The execution plan is as follows. The test01 table is indeed fully scanned:

root@localhost [dbtest01]>desc SELECT * FROM test01 WHERE test01.a IN (SELECT test02.b FROM test02 WHERE id < 10);
+----+--------------------+--------+------------+-------+---------------+---------+---------+------+--------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+--------------------+--------+------------+-------+---------------+---------+---------+------+--------+----------+-------------+
| 1 | PRIMARY | test01 | NULL | ALL | NULL | NULL | NULL | NULL | 1000 | 100.00 | Using where |
| 2 | DEPENDENT SUBQUERY | test02 | NULL | range | PRIMARY | PRIMARY | 4 | NULL | 9 | 10.00 | Using where |
+----+--------------------+--------+------------+-------+---------------+---------+---------+------+--------+----------+-------------+
2 rows in set, 1 warning (0.00 sec)

Let's analyze this execution plan:

! ! ! ! Tip again: If you use MySQL 5.5 or earlier versions, or MySQL 5.6 or later versions and turn off the optimizer strategy materialization=off, semijoin=off, the SQL execution plan you get is the same as the following

root@localhost [dbtest01]>desc select * from test01 where exists(select b from test02 where id < 10 and test01.a=test02.b);
+----+--------------------+--------+------------+-------+---------------+---------+---------+------+------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+--------------------+--------+------------+-------+---------------+---------+---------+------+------+----------+-------------+
| 1 | PRIMARY | test01 | NULL | ALL | NULL | NULL | NULL | NULL | 1000 | 100.00 | Using where |
| 2 | DEPENDENT SUBQUERY | test02 | NULL | range | PRIMARY | PRIMARY | 4 | NULL | 9 | 10.00 | Using where |
+----+--------------------+--------+------------+-------+---------------+---------+---------+------+------+----------+-------------+
2 rows in set, 2 warnings (0.00 sec)

The uncorrelated subquery becomes a correlated subquery (select_type:DEPENDENT SUBQUERY). The subquery needs to be associated with the outer table test01 based on b. Because the test01 field of the outer table is required, the subquery cannot be executed first. The execution process is:

  1. Scan test01 and extract a row of data R from test01;
  2. From the data row R, take out the field a and execute the subquery. If the result is TRUE, put this row of data R into the result set.
  3. Repeat 1 and 2 until the end.

The total number of scanned rows is 1000+1000*9=10000 (this is a theoretical value, but the actual value is less than 10000. I have never figured out how it came about. The rule is that for every additional row in the subquery result set, the total number of scanned rows will be reduced by a few rows).

Semi-join optimizer:

This will cause a problem. If the outer table is a very large table, the subquery must be executed once for each row of the outer query, and the performance of this query will be very poor. It is easy to think of rewriting it into join to improve efficiency:

select test01.* from test01 join test02 on test01.a=test02.b and test02.id<10;

# View the execution plan of this SQL:

desc select test01.* from test01 join test02 on test01.a=test02.b and test02.id<10;

root@localhost [dbtest01]>EXPLAIN extended select test01.* from test01 join test02 on test01.a=test02.b and test02.id<10;
+----+-------------+--------+------------+-------+---------------+--------+---------+-------------------+------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+--------+------------+-------+---------------+--------+---------+-------------------+------+----------+-------------+
| 1 | SIMPLE | test02 | NULL | range | PRIMARY | PRIMARY | 4 | NULL | 9 | 100.00 | Using where |
| 1 | SIMPLE | test01 | NULL | ref | a | a | 5 | dbtest01.test02.b | 1 | 100.00 | NULL |
+----+-------------+--------+------------+-------+---------------+--------+---------+-------------------+------+----------+-------------+
2 rows in set, 2 warnings (0.00 sec)

This optimization allows the t2 table to be the driving table, and the associated fields of the t1 table have indexes, which makes the search efficiency very high.

But there is a problem here. Join may get duplicate results, while in (select ...) subquery semantics will not get duplicate values.
Semijoin is a special join that solves the problem of duplicate values.
In a subquery, the optimizer can recognize that only one value needs to be returned for each group in the in clause. In this case, semijoin can be used to optimize the subquery and improve query efficiency.
This is a new feature added in MySQL 5.6. Before MySQL 5.6, the optimizer only had the exists strategy to "optimize" subqueries.

The SQL and execution plan after semijoin optimization are as follows:

root@localhost [dbtest01]>desc SELECT * FROM test01 WHERE test01.a IN (SELECT test02.b FROM test02 WHERE id < 10);
+----+--------------+-------------+------------+-------+---------------+--------+---------+---------------+------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+--------------+-------------+------------+-------+---------------+--------+---------+---------------+------+----------+-------------+
| 1 | SIMPLE | <subquery2> | NULL | ALL | NULL | NULL | NULL | NULL | NULL | 100.00 | Using where |
| 1 | SIMPLE | test01 | NULL | ref | a | a | 5 | <subquery2>.b | 1 | 100.00 | NULL |
| 2 | MATERIALIZED | test02 | NULL | range | PRIMARY | PRIMARY | 4 | NULL | 9 | 100.00 | Using where |
+----+--------------+-------------+------------+-------+---------------+--------+---------+---------------+------+----------+-------------+
3 rows in set, 1 warning (0.00 sec)
select 
  `test01`.`id`,`test01`.`a`,`test01`.`b` 
from `test01` semi join `test02` 
where
  ((`test01`.`a` = `<subquery2>`.`b`) 
  and (`test02`.`id` < 10)); 

##Note that this is the SQL rewritten by the optimizer. The semi join syntax cannot be used on the client.

The optimization implementation of semijoin is relatively complex, and it is divided into strategies such as FirstMatch and Materialize. In the above execution plan, select_type=MATERIALIZED means that the semijoin is implemented using the Materialize strategy.
Here the execution flow of semijoin after optimization is:

First execute the subquery and save the result to a temporary table. This temporary table has a primary key for deduplication.
Take a row of data R from the temporary table;
From the data row R, take out the field b and search it in the driven table t1. If it meets the conditions, put it into the result set.
Repeat steps 2 and 3 until the end.
In this way, the subquery result has 9 rows, that is, the temporary table also has 9 rows (there are no duplicate values ​​here), and the total number of scanned rows is 9+9+9*1=27 rows, which is much less than the original 10,000 rows.

Another optimization feature added in MySQL 5.6 is materialization, which materializes the subquery results into a temporary table and then substitutes them into the outer query for search to speed up the query execution. The in-memory temporary table contains the primary key (hash index), eliminates duplicate rows, and makes the table smaller.
If the subquery result is too large and exceeds the tmp_table_size, it will degenerate into a disk temporary table. This way the subquery only needs to be executed once, rather than for each row of the outer query.
However, it should be noted that such external queries still cannot quickly find eligible data through indexes and can only be performed through full table scans or full index scans.

The opening of semijoin and materialization is controlled by the semijoin={on|off}, materialization={on|off} flags in the optimizer_switch parameter.
The different execution plans mentioned above are generated by turning semijoin and materialization on and off. Generally speaking, for subqueries, first check whether the conditions of various optimization strategies are met (for example, if there is a union in the subquery, semijoin optimization cannot be used)
The optimizer will then make a choice based on cost. If there is no other choice, the exists strategy will be used to "optimize" the subquery. There is no parameter to enable or disable the exists strategy.

Here is an example of a delete-related subquery:

Fill the above two test tables with 3.5 million data and 500,000 data respectively to test the delete statement

root@localhost [dbtest01]>select count(*) from test02;
+----------+
| count(*) |
+----------+
|3532986|
+----------+
1 row in set (0.64 sec)
root@localhost [dbtest01]>create table test01 like test02;
Query OK, 0 rows affected (0.01 sec)

root@localhost [dbtest01]>insert into test01 (select * from test02 where id<=500000)

root@localhost [dbtest01]>select count(*) from test01;
+----------+
| count(*) |
+----------+
| 500000 |

The delete statement took 4 seconds to execute.

root@localhost [dbtest01]>delete FROM test01 WHERE test01.a IN (SELECT test02.b FROM test02 WHERE id < 10);
Query OK, 9 rows affected (4.86 sec)

Looking at the execution plan, we can see that the test01 table is almost completely scanned:

root@localhost [dbtest01]>desc delete FROM test01 WHERE test01.a IN (SELECT test02.b FROM test02 WHERE id < 10);
+----+--------------------+--------+------------+-------+---------------+---------+---------+------+--------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+--------------------+--------+------------+-------+---------------+---------+---------+------+--------+----------+-------------+
| 1 | DELETE | test01 | NULL | ALL | NULL | NULL | NULL | NULL | 499343 | 100.00 | Using where |
| 2 | DEPENDENT SUBQUERY | test02 | NULL | range | PRIMARY | PRIMARY | 4 | NULL | 9 | 10.00 | Using where |
+----+--------------------+--------+------------+-------+---------------+---------+---------+------+--------+----------+-------------+
2 rows in set (0.00 sec)

So modify the above delete SQL statement to a pseudo join statement

root@localhost [dbtest01]>desc delete test01.* from test01 join test02 on test01.a=test02.b and test02.id<10;
+----+-------------+--------+------------+-------+---------------+--------+---------+-------------------+------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+--------+------------+-------+---------------+--------+---------+-------------------+------+----------+-------------+
| 1 | SIMPLE | test02 | NULL | range | PRIMARY | PRIMARY | 4 | NULL | 9 | 100.00 | Using where |
| 1 | DELETE | test01 | NULL | ref | a | a | 5 | dbtest01.test02.b | 1 | 100.00 | NULL |
+----+-------------+--------+------------+-------+---------------+--------+---------+-------------------+------+----------+-------------+
2 rows in set (0.01 sec)

Execution is very fast root@localhost [dbtest01]>delete test01.* from test01 join test02 on test01.a=test02.b and test02.id<10;
Query OK, 9 rows affected (0.01 sec)

root@localhost [dbtest01]>select test01.* from test01 join test02 on test01.a=test02.b and test02.id<10;
Empty set (0.00 sec)

The following table execution requires a full table scan, which is very slow. It basically performs a full table scan on table test01:

root@lcalhost [dbtest01]>desc delete FROM test01 WHERE id IN (SELECT id FROM test02 WHERE id='350000');
+----+--------------------+--------+------------+-------+---------------+---------+---------+-------+--------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+--------------------+--------+------------+-------+---------------+---------+---------+-------+--------+----------+-------------+
| 1 | DELETE | test01 | NULL | ALL | NULL | NULL | NULL | NULL | 499343 | 100.00 | Using where |
| 2 | DEPENDENT SUBQUERY | test02 | NULL | const | PRIMARY | PRIMARY | 4 | const | 1 | 100.00 | Using index |
+----+--------------------+--------+------------+-------+---------------+---------+---------+-------+--------+----------+-------------+
2 rows in set (0.00 sec)

However, if join is used, the efficiency is very high:

root@localhost [dbtest01]>desc delete test01.* FROM test01 inner join test02 WHERE test01.id=test02.id and test02.id=350000;
+----+-------------+--------+------------+-------+---------------+--------+---------+-------+------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+--------+------------+-------+---------------+--------+---------+-------+------+----------+-------------+
| 1 | DELETE | test01 | NULL | const | PRIMARY | PRIMARY | 4 | const | 1 | 100.00 | NULL |
| 1 | SIMPLE | test02 | NULL | const | PRIMARY | PRIMARY | 4 | const | 1 | 100.00 | Using index |
+----+-------------+--------+------------+-------+---------------+--------+---------+-------+------+----------+-------------+
2 rows in set (0.01 sec)

 
root@localhost [dbtest01]> desc delete test01.* from test01 join test02 on test01.a=test02.b and test02.id=350000;
+----+-------------+--------+------------+-------+---------------+--------+---------+-------+------+------+------+------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+--------+------------+-------+---------------+--------+---------+-------+------+------+------+------+
| 1 | SIMPLE | test02 | NULL | const | PRIMARY | PRIMARY | 4 | const | 1 | 100.00 | NULL |
| 1 | DELETE | test01 | NULL | ref | a | a | 5 | const | 1 | 100.00 | NULL |
+----+-------------+--------+------------+-------+---------------+--------+---------+-------+------+------+------+------+
2 rows in set (0.00 sec)

Reference Documents:

https://www.cnblogs.com/zhengyun_ustc/p/slowquery1.html
https://www.jianshu.com/p/3989222f7084
https://dev.mysql.com/doc/refman/5.6/en/subquery-optimization.html

This is the end of this article about the implementation of MySQL select in subquery optimization. For more relevant MySQL select in subquery optimization content, please search 123WORDPRESS.COM's previous articles or continue to browse the following related articles. I hope everyone will support 123WORDPRESS.COM in the future!

You may also be interested in:
  • A brief discussion on MySQL select optimization solution
  • MySQL select results to perform update example tutorial
  • Solve the problem that MySQL read-write separation causes data not to be selected after insert
  • How MySQL Select Statement is Executed
  • MySQL learning notes: complete select statement usage example detailed explanation
  • MySQL select, insert, update batch operation statement code examples
  • A brief understanding of MySQL SELECT execution order
  • Explanation of mysql transaction select for update and data consistency processing
  • The difference between Update and select in MySQL for single and multiple tables, and views and temporary tables
  • Detailed explanation of the use of MySQL select cache mechanism
  • Summary of Select usage in MySql database
  • How a select statement is executed in MySQL

<<:  Detailed explanation of Docker data backup and recovery process

>>:  Rendering Function & JSX Details

Recommend

Some notes on modifying the innodb_data_file_path parameter of MySQL

Preface innodb_data_file_path is used to specify ...

Problems with index and FROM_UNIXTIME in mysql

Zero, Background I received a lot of alerts this ...

Detailed explanation of Mysql logical architecture

1. Overall architecture diagram Compared to other...

Full-screen drag upload component based on Vue3

This article mainly introduces the full-screen dr...

Example of ellipsis when CSS multi-line text overflows

Ellipses appear when multi-line text overflows Th...

How to get form data in Vue

Table of contents need Get data and submit Templa...

The DOCTYPE mode selection mechanism of well-known browsers

Document Scope This article covers mode switching...

Some functions of using tcpdump to capture packets in the Linux command line

tcpdump is a flexible and powerful packet capture...

JavaScript canvas to achieve raindrop effect

This article example shares the specific code for...

Problems and solutions of using jsx syntax in React-vscode

Problem Description After installing the plugin E...

The meaning of status code in HTTP protocol

A status code that indicates a provisional respon...

Pure CSS3 code to implement a running clock

Operation effectCode Implementation html <div ...

mysql batch delete large amounts of data

mysql batch delete large amounts of data Assume t...

How to delete table data in MySQL

There are two ways to delete data in MySQL, one i...