The following demonstration is based on MySQL version 5.7.27 1. Introduction to MySQL subquery optimization strategy: Subquery optimization strategy The optimizer chooses different strategies for different types of subqueries. 1. For IN and =ANY subqueries, the optimizer has the following strategy choices:
2. For NOT IN and <> ALL subqueries, the optimizer has the following strategy choices:
3. For derived tables, the optimizer has the following strategy options: 2. Create data for simulation demonstration In order to facilitate the analysis of the problem, create two tables and insert simulated data: CREATE TABLE `test02` ( `id` int(11) NOT NULL, `a` int(11) DEFAULT NULL, `b` int(11) DEFAULT NULL, PRIMARY KEY (`id`), KEY `a` (`a`) )ENGINE=InnoDB; drop procedure idata; delimiter;; create procedure idata() begin declare i int; set i=1; while(i<=10000)do insert into test02 values(i, i, i); set i=i+1; end while; end;; delimiter ; call idata(); create table test01 like test02; insert into test01 (select * from test02 where id<=1000) 3. Example Analysis of SQL Instances Subquery example: SELECT * FROM test01 WHERE test01.a IN (SELECT test02.b FROM test02 WHERE id < 10) Most people would simply think that this SQL will be executed like this: SELECT test02.b FROM test02 WHERE id < 10 Results: 1, 2, 3, 4, 5, 6, 7, 8, 9 SELECT * FROM test01 WHERE test01.a IN (1,2,3,4,5,6,7,8,9); But that's not actually how MySQL works. MySQL will push the related outer table into the subquery, and the optimizer believes that this is more efficient. In other words, the optimizer will rewrite the above SQL into this: select * from test01 where exists(select b from test02 where id < 10 and test01.a=test02.b); Tip: For MySQL 5.5 and earlier versions Check the execution plan as follows and find that this SQL performs a full table scan 1000 times on table test01, which is inefficient: root@localhost [dbtest01]>desc select * from test01 where exists(select b from test02 where id < 10 and test01.a=test02.b); +----+--------------------+--------+------------+-------+---------------+---------+---------+------+--------+----------+-------------+ | id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra | +----+--------------------+--------+------------+-------+---------------+---------+---------+------+--------+----------+-------------+ | 1 | PRIMARY | test01 | NULL | ALL | NULL | NULL | NULL | NULL | 1000 | 100.00 | Using where | | 2 | DEPENDENT SUBQUERY | test02 | NULL | range | PRIMARY | PRIMARY | 4 | NULL | 9 | 10.00 | Using where | +----+--------------------+--------+------------+-------+---------------+---------+---------+------+--------+----------+-------------+ 2 rows in set, 2 warnings (0.00 sec) But when we actually execute the following SQL, we find that it is not slow at all. Isn't this a contradiction? Don't worry, let's continue to analyze it: SELECT * FROM test01 WHERE test01.a IN (SELECT test02.b FROM test02 WHERE id < 10) Check the execution plan of this SQL as follows: root@localhost [dbtest01]>desc SELECT * FROM test01 WHERE test01.a IN (SELECT test02.b FROM test02 WHERE id < 10); +----+--------------+-------------+------------+-------+---------------+--------+---------+---------------+------+----------+-------------+ | id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra | +----+--------------+-------------+------------+-------+---------------+--------+---------+---------------+------+----------+-------------+ | 1 | SIMPLE | <subquery2> | NULL | ALL | NULL | NULL | NULL | NULL | NULL | 100.00 | Using where | | 1 | SIMPLE | test01 | NULL | ref | a | a | 5 | <subquery2>.b | 1 | 100.00 | NULL | | 2 | MATERIALIZED | test02 | NULL | range | PRIMARY | PRIMARY | 4 | NULL | 9 | 100.00 | Using where | +----+--------------+-------------+------------+-------+---------------+--------+---------+---------------+------+----------+-------------+ 3 rows in set, 1 warning (0.00 sec) It is found that the optimizer uses the MATERIALIZED strategy. So I searched for information and studied this strategy. The reason is that since MySQL 5.6, including MySQL 5.6, the optimizer has introduced new optimization strategies: materialization=[off|on], semijoin=[off|on], (off means turning off this strategy, on means turning on this strategy) The default optimizer strategy for MySQL 5.7.27 is: root@localhost [dbtest01]>show variables like 'optimizer_switch'; +------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Variable_name | Value | +------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | optimizer_switch | index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,engine_condition_pushdown=on,index_condition_pushdown=on,mrr=on,mrr_cost_based=on,block_nested_loop=on,batched_key_access=off,materialization=on,semijoin=on,loosescan=on,firstmatch=on,duplicateweedout=on,subquery_materialization_cost_based=on,use_index_extensions=on,condition_fanout_filter=on,derived_merge=on | +------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- So in MySQL 5.6 and above Executing the following SQL will not be slow. Because MySQL optimizer strategies materialization and semijoin optimize this SQL SELECT * FROM test01 WHERE test01.a IN (SELECT test02.b FROM test02 WHERE id < 10) However, we turned off the MySQL optimizer strategies materialization and semijoin for testing, and found that SQL did scan the entire table of test01 (1000): set global optimizer_switch='materialization=off,semijoin=off'; The execution plan is as follows. The test01 table is indeed fully scanned: root@localhost [dbtest01]>desc SELECT * FROM test01 WHERE test01.a IN (SELECT test02.b FROM test02 WHERE id < 10); +----+--------------------+--------+------------+-------+---------------+---------+---------+------+--------+----------+-------------+ | id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra | +----+--------------------+--------+------------+-------+---------------+---------+---------+------+--------+----------+-------------+ | 1 | PRIMARY | test01 | NULL | ALL | NULL | NULL | NULL | NULL | 1000 | 100.00 | Using where | | 2 | DEPENDENT SUBQUERY | test02 | NULL | range | PRIMARY | PRIMARY | 4 | NULL | 9 | 10.00 | Using where | +----+--------------------+--------+------------+-------+---------------+---------+---------+------+--------+----------+-------------+ 2 rows in set, 1 warning (0.00 sec) Let's analyze this execution plan: ! ! ! ! Tip again: If you use MySQL 5.5 or earlier versions, or MySQL 5.6 or later versions and turn off the optimizer strategy materialization=off, semijoin=off, the SQL execution plan you get is the same as the following root@localhost [dbtest01]>desc select * from test01 where exists(select b from test02 where id < 10 and test01.a=test02.b); +----+--------------------+--------+------------+-------+---------------+---------+---------+------+------+----------+-------------+ | id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra | +----+--------------------+--------+------------+-------+---------------+---------+---------+------+------+----------+-------------+ | 1 | PRIMARY | test01 | NULL | ALL | NULL | NULL | NULL | NULL | 1000 | 100.00 | Using where | | 2 | DEPENDENT SUBQUERY | test02 | NULL | range | PRIMARY | PRIMARY | 4 | NULL | 9 | 10.00 | Using where | +----+--------------------+--------+------------+-------+---------------+---------+---------+------+------+----------+-------------+ 2 rows in set, 2 warnings (0.00 sec) The uncorrelated subquery becomes a correlated subquery (select_type:DEPENDENT SUBQUERY). The subquery needs to be associated with the outer table test01 based on b. Because the test01 field of the outer table is required, the subquery cannot be executed first. The execution process is:
The total number of scanned rows is 1000+1000*9=10000 (this is a theoretical value, but the actual value is less than 10000. I have never figured out how it came about. The rule is that for every additional row in the subquery result set, the total number of scanned rows will be reduced by a few rows). Semi-join optimizer: This will cause a problem. If the outer table is a very large table, the subquery must be executed once for each row of the outer query, and the performance of this query will be very poor. It is easy to think of rewriting it into join to improve efficiency: select test01.* from test01 join test02 on test01.a=test02.b and test02.id<10; # View the execution plan of this SQL: desc select test01.* from test01 join test02 on test01.a=test02.b and test02.id<10; root@localhost [dbtest01]>EXPLAIN extended select test01.* from test01 join test02 on test01.a=test02.b and test02.id<10; +----+-------------+--------+------------+-------+---------------+--------+---------+-------------------+------+----------+-------------+ | id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra | +----+-------------+--------+------------+-------+---------------+--------+---------+-------------------+------+----------+-------------+ | 1 | SIMPLE | test02 | NULL | range | PRIMARY | PRIMARY | 4 | NULL | 9 | 100.00 | Using where | | 1 | SIMPLE | test01 | NULL | ref | a | a | 5 | dbtest01.test02.b | 1 | 100.00 | NULL | +----+-------------+--------+------------+-------+---------------+--------+---------+-------------------+------+----------+-------------+ 2 rows in set, 2 warnings (0.00 sec) This optimization allows the t2 table to be the driving table, and the associated fields of the t1 table have indexes, which makes the search efficiency very high. But there is a problem here. Join may get duplicate results, while in (select ...) subquery semantics will not get duplicate values. The SQL and execution plan after semijoin optimization are as follows: root@localhost [dbtest01]>desc SELECT * FROM test01 WHERE test01.a IN (SELECT test02.b FROM test02 WHERE id < 10); +----+--------------+-------------+------------+-------+---------------+--------+---------+---------------+------+----------+-------------+ | id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra | +----+--------------+-------------+------------+-------+---------------+--------+---------+---------------+------+----------+-------------+ | 1 | SIMPLE | <subquery2> | NULL | ALL | NULL | NULL | NULL | NULL | NULL | 100.00 | Using where | | 1 | SIMPLE | test01 | NULL | ref | a | a | 5 | <subquery2>.b | 1 | 100.00 | NULL | | 2 | MATERIALIZED | test02 | NULL | range | PRIMARY | PRIMARY | 4 | NULL | 9 | 100.00 | Using where | +----+--------------+-------------+------------+-------+---------------+--------+---------+---------------+------+----------+-------------+ 3 rows in set, 1 warning (0.00 sec) select `test01`.`id`,`test01`.`a`,`test01`.`b` from `test01` semi join `test02` where ((`test01`.`a` = `<subquery2>`.`b`) and (`test02`.`id` < 10)); ##Note that this is the SQL rewritten by the optimizer. The semi join syntax cannot be used on the client. The optimization implementation of semijoin is relatively complex, and it is divided into strategies such as FirstMatch and Materialize. In the above execution plan, select_type=MATERIALIZED means that the semijoin is implemented using the Materialize strategy. First execute the subquery and save the result to a temporary table. This temporary table has a primary key for deduplication. Another optimization feature added in MySQL 5.6 is materialization, which materializes the subquery results into a temporary table and then substitutes them into the outer query for search to speed up the query execution. The in-memory temporary table contains the primary key (hash index), eliminates duplicate rows, and makes the table smaller. The opening of semijoin and materialization is controlled by the semijoin={on|off}, materialization={on|off} flags in the optimizer_switch parameter. Here is an example of a delete-related subquery: Fill the above two test tables with 3.5 million data and 500,000 data respectively to test the delete statement root@localhost [dbtest01]>select count(*) from test02; +----------+ | count(*) | +----------+ |3532986| +----------+ 1 row in set (0.64 sec) root@localhost [dbtest01]>create table test01 like test02; Query OK, 0 rows affected (0.01 sec) root@localhost [dbtest01]>insert into test01 (select * from test02 where id<=500000) root@localhost [dbtest01]>select count(*) from test01; +----------+ | count(*) | +----------+ | 500000 | The delete statement took 4 seconds to execute. root@localhost [dbtest01]>delete FROM test01 WHERE test01.a IN (SELECT test02.b FROM test02 WHERE id < 10); Query OK, 9 rows affected (4.86 sec) Looking at the execution plan, we can see that the test01 table is almost completely scanned: root@localhost [dbtest01]>desc delete FROM test01 WHERE test01.a IN (SELECT test02.b FROM test02 WHERE id < 10); +----+--------------------+--------+------------+-------+---------------+---------+---------+------+--------+----------+-------------+ | id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra | +----+--------------------+--------+------------+-------+---------------+---------+---------+------+--------+----------+-------------+ | 1 | DELETE | test01 | NULL | ALL | NULL | NULL | NULL | NULL | 499343 | 100.00 | Using where | | 2 | DEPENDENT SUBQUERY | test02 | NULL | range | PRIMARY | PRIMARY | 4 | NULL | 9 | 10.00 | Using where | +----+--------------------+--------+------------+-------+---------------+---------+---------+------+--------+----------+-------------+ 2 rows in set (0.00 sec) So modify the above delete SQL statement to a pseudo join statement root@localhost [dbtest01]>desc delete test01.* from test01 join test02 on test01.a=test02.b and test02.id<10; +----+-------------+--------+------------+-------+---------------+--------+---------+-------------------+------+----------+-------------+ | id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra | +----+-------------+--------+------------+-------+---------------+--------+---------+-------------------+------+----------+-------------+ | 1 | SIMPLE | test02 | NULL | range | PRIMARY | PRIMARY | 4 | NULL | 9 | 100.00 | Using where | | 1 | DELETE | test01 | NULL | ref | a | a | 5 | dbtest01.test02.b | 1 | 100.00 | NULL | +----+-------------+--------+------------+-------+---------------+--------+---------+-------------------+------+----------+-------------+ 2 rows in set (0.01 sec) Execution is very fast root@localhost [dbtest01]>delete test01.* from test01 join test02 on test01.a=test02.b and test02.id<10; Query OK, 9 rows affected (0.01 sec) root@localhost [dbtest01]>select test01.* from test01 join test02 on test01.a=test02.b and test02.id<10; Empty set (0.00 sec) The following table execution requires a full table scan, which is very slow. It basically performs a full table scan on table test01: root@lcalhost [dbtest01]>desc delete FROM test01 WHERE id IN (SELECT id FROM test02 WHERE id='350000'); +----+--------------------+--------+------------+-------+---------------+---------+---------+-------+--------+----------+-------------+ | id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra | +----+--------------------+--------+------------+-------+---------------+---------+---------+-------+--------+----------+-------------+ | 1 | DELETE | test01 | NULL | ALL | NULL | NULL | NULL | NULL | 499343 | 100.00 | Using where | | 2 | DEPENDENT SUBQUERY | test02 | NULL | const | PRIMARY | PRIMARY | 4 | const | 1 | 100.00 | Using index | +----+--------------------+--------+------------+-------+---------------+---------+---------+-------+--------+----------+-------------+ 2 rows in set (0.00 sec) However, if join is used, the efficiency is very high: root@localhost [dbtest01]>desc delete test01.* FROM test01 inner join test02 WHERE test01.id=test02.id and test02.id=350000; +----+-------------+--------+------------+-------+---------------+--------+---------+-------+------+----------+-------------+ | id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra | +----+-------------+--------+------------+-------+---------------+--------+---------+-------+------+----------+-------------+ | 1 | DELETE | test01 | NULL | const | PRIMARY | PRIMARY | 4 | const | 1 | 100.00 | NULL | | 1 | SIMPLE | test02 | NULL | const | PRIMARY | PRIMARY | 4 | const | 1 | 100.00 | Using index | +----+-------------+--------+------------+-------+---------------+--------+---------+-------+------+----------+-------------+ 2 rows in set (0.01 sec) root@localhost [dbtest01]> desc delete test01.* from test01 join test02 on test01.a=test02.b and test02.id=350000; +----+-------------+--------+------------+-------+---------------+--------+---------+-------+------+------+------+------+ | id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra | +----+-------------+--------+------------+-------+---------------+--------+---------+-------+------+------+------+------+ | 1 | SIMPLE | test02 | NULL | const | PRIMARY | PRIMARY | 4 | const | 1 | 100.00 | NULL | | 1 | DELETE | test01 | NULL | ref | a | a | 5 | const | 1 | 100.00 | NULL | +----+-------------+--------+------------+-------+---------------+--------+---------+-------+------+------+------+------+ 2 rows in set (0.00 sec) Reference Documents: https://www.cnblogs.com/zhengyun_ustc/p/slowquery1.html This is the end of this article about the implementation of MySQL select in subquery optimization. For more relevant MySQL select in subquery optimization content, please search 123WORDPRESS.COM's previous articles or continue to browse the following related articles. I hope everyone will support 123WORDPRESS.COM in the future! You may also be interested in:
|
<<: Detailed explanation of Docker data backup and recovery process
>>: Rendering Function & JSX Details
Preface innodb_data_file_path is used to specify ...
Zero, Background I received a lot of alerts this ...
1. Overall architecture diagram Compared to other...
This article mainly introduces the full-screen dr...
Ellipses appear when multi-line text overflows Th...
Table of contents need Get data and submit Templa...
Document Scope This article covers mode switching...
tcpdump is a flexible and powerful packet capture...
The emergence of jQuery has greatly improved our ...
This article example shares the specific code for...
Problem Description After installing the plugin E...
A status code that indicates a provisional respon...
Operation effectCode Implementation html <div ...
mysql batch delete large amounts of data Assume t...
There are two ways to delete data in MySQL, one i...