Why is IN slow?After using subqueries in the application, the query performance of the SQL statement becomes very bad. For example: SELECT driver_id FROM driver where driver_id in (SELECT driver_id FROM driver where _create_date > '2016-07-25 00:00:00'); The independent subquery returns the driver_id that meets the conditions. This problem is solved, but it takes 6 seconds. You can view the execution plan of the SQL statement through EXPLAIN: You can see that the above SQL statement becomes a correlated subquery. Through the EXPLAIN EXTENDED and SHOW WARNINGS commands, you can see the following results: Copy the code as follows: select `northwind`.`driver`.`driver_id` AS `driver_id` from `northwind`.`driver` where <in_optimizer>(`northwind`.`driver`.`driver_id`,<exists>(select 1 from `northwind`.`driver` where ((`northwind`.`driver`.`_create_date` > '2016-07-25 00:00:00') and (<cache>(`northwind`.`driver`.`driver_id`) = `northwind`.`driver`.`driver_id`)))) It can be seen that the MySql optimizer directly converts the IN clause into a correlated subquery of EXISTS. Here is a correlated IN subquery: SELECT driver_id FROM driver where driver_id in (SELECT driver_id FROM user where user.uid = driver.driver_id); View the execution plan of the SQL statement: This is a correlated subquery. Through the EXPLAIN EXTENDED and SHOW WARNINGS commands, we can see the following results: Copy the code as follows: select `northwind`.`driver`.`driver_id` AS `driver_id` from `northwind`.`driver` where <in_optimizer>(`northwind`.`driver`.`driver_id`,<exists>(select 1 from `northwind`.`user` where ((`northwind`.`user`.`uid` = `northwind`.`driver`.`driver_id`) and (<cache>(`northwind`.`driver`.`driver_id`) = `northwind`.`driver`.`driver_id`)))) It can be seen that the optimizer before MySQL 5.5 converts IN into EXISTS statements regardless of whether it is an independent subquery or a correlated subquery. If the subquery and outer query return M and N rows respectively, then the subquery is scanned in O(N+N*M) instead of O(N+M). This is why IN is slow. Which is faster, IN or EXISTS?Baidu has found many articles online that say it is wrong to say that IN and EXISTS are equally efficient. If the two tables being queried are of similar size, there is little difference between using in and exists. Contrary 2: To summarize the above description, I personally think that the main reason lies in the use of indexes. In any case, as long as the index of a large table is used, efficiency can be improved. However, when editing this article, I tested it many times but failed to get the results summarized above. The following is the test SQL statement. First, the outer table is a large table and the inner table is a small table. (Example 1) SELECT count(driver_id) FROM driver where driver_id in (SELECT uid FROM user); SELECT count(driver_id) FROM driver where exists (SELECT 1 FROM user where uid = driver.driver_id); The execution result is: Then the outer surface is a small table, and the inner surface is a big table. (Example 2) select count(uid) from user where uid in (SELECT driver_id FROM driver); select count(uid) from user where exists (SELECT 1 FROM driver where driver.driver_id = user.uid); The execution result is: It can be found that the execution efficiency of IN and EXISTS is exactly the same in any case. Based on this, we continue to look at the execution plans of the first and second SQL statements in Example 1, as follows: It can be seen that the execution plans of IN and EXISTS are the same, and the conclusion drawn from this is that the execution efficiency of the two should be the same. "MySql Technology Insider: SQL Programming": The book describes that many DBAs do think that EXISTS is more efficient than IN. It may be that the optimizer was not very stable or good enough at the time. However, in most cases, IN and EXISTS have the same execution plan. How to improve efficiency?The SQL statement in Example 2 above takes about 8 seconds to execute. The slow query is caused by the existence of M*N, but it can still be optimized. Note that the reason for the slowness is that each time the internal query is compared with the external query, it is necessary to traverse the table once. Another method can be used to nest a layer of subqueries to avoid multiple traversal operations. The statement is as follows: SELECT count(driver_id) FROM driver where exists (SELECT uid FROM (SELECT uid from user) as b where b.uid = driver.driver_id); The execution effect is as shown below: It can be found that the optimization reduces the execution time by more than 6 seconds. The following is the SQL execution plan: It is still a correlated subquery, but the internal traversal query operations are reduced. Therefore, pre-query can be used to reduce traversal operations and improve efficiency. In fact, in actual programming, many developers choose not to use join table queries, but to first retrieve the data from one table and then perform a WHEREIN operation in another table. The principle is the same as that implemented in the above SQL statement. MySQL 5.6 optimizes subqueries?SEMI JOIN strategyThe optimizer recognizes that the IN statement requires a subquery to return one instance of each region key from the region table. This causes MySQL to execute the SELECT statement in a semi-joined manner, so there will only be one instance of each region in the global table that matches the record. There are two very important differences between semi-joins and regular joins:
Therefore, the result of a semi-join is often a subset of the records from the outer table. From the perspective of effectiveness, the optimization of semi-join is to effectively eliminate duplicate items from the inner table. MySQL applies four different semi-join execution strategies to eliminate duplicate items. Table Pullout Optimization
Sometimes, a subquery can be rewritten as a JOIN, for example: SELECT OrderID FROM Orders where EmployeeID IN (select EmployeeID from Employees where EmployeeID > 3); If it is known that OrderID is unique, that is, a primary key or a unique index, the SQL statement will be rewritten as a Join. SELECT OrderID FROM Orders join Employees where Orders.EmployeeID = Employees.EmployeeID and Employees.EmployeeID > 3; The purpose of table pullout is to rewrite the subquery into a JOIN statement based on the unique index. In MySQL 5.5, the execution plan of the above SQL statement is: If you use the EXPLAIN EXTENDED and SHOW WARNINGS commands, you can see the following results: Copy the code as follows: select `northwind`.`Orders`.`OrderID` AS `OrderID` from `northwind`.`Orders` where <in_optimizer>(`northwind`.`Orders`.`EmployeeID`,<exists>(<primary_index_lookup>(<cache>(`northwind`.`Orders`.`EmployeeID`) in Employees on PRIMARY where ((`northwind`.`Employees`.`EmployeeID` > 3) and (<cache>(`northwind`.`Orders`.`EmployeeID`) = `northwind`.`Employees`.`EmployeeID`))))) It is exactly why in is slow as mentioned above? In MySQL 5.6, the optimizer will rewrite the SQL statement and obtain the following execution plan: In MySQL 5.6, the optimizer does not rewrite independent subqueries into correlated subqueries. The optimizer's execution mode is obtained through the EXPLAIN EXTENDED and SHOW WARNINGS commands: Copy the code as follows: /* select #1 */ select `northwind`.`orders`.`OrderID` AS `OrderID` from `northwind`.`employees` join `northwind`.`orders` where ((`northwind`.`orders`.`EmployeeID` = `northwind`.`employees`.`EmployeeID`) and (`northwind`.`employees`.`EmployeeID` > 3)) Obviously, the optimizer rewrites the above subquery into a JOIN statement, which is the Table Pullout optimization. Duplicate Weedout OptimizationRun the semi-join as if it was a join and remove duplicate records using a temporary table. The columns found in the inner table above are unique, so the optimizer will rewrite the subquery into a JOIN statement to improve the efficiency of SQL execution. Duplicate Weedout optimization means that the columns in the external query condition are unique, and the MySql optimizer will first deduplicate the results found by the subquery. For example, the following SQL statement: SELECT ContactName FROM Customers where CustomerID in (select CustomerID from Orders where OrderID > 10000 and Customers.Country = Orders.ShipCountry); Because CustomerID is the primary key, the results obtained from the subquery should be deduplicated. Execution plan in MySql 5.6: The "Start temporary" in the "Extra" option means to create a temporary table for deduplication, and "End temporary" means to delete the temporary table. Through the EXPLAIN EXTENDED and SHOW WARNINGS commands, the optimizer's execution mode is obtained as follows: Copy the code as follows: /* select #1 */ select `northwind`.`customers`.`ContactName` AS `ContactName` from `northwind`.`customers` semi join (`northwind`.`orders`) where ((`northwind`.`customers`.`CustomerID` = `northwind`.`orders`.`CustomerID`) and (`northwind`.`customers`.`Country` = `northwind`.`orders`.`ShipCountry`) and (`northwind`.`orders`.`OrderID` > 10000)) Different from the Table Pullout optimization, semi join is displayed instead of join. The reason is that there is more deduplication work. For the above execution plan, the scanning cost is about 830+830*1=1660 times. As you can see, in MySql 5.5, the statement is still converted into a correlated subquery, and the scanning cost is about 93+93*9=930 times. We can see that after optimization, the scanning cost of MySql 5.6 is higher than that of 5.5. In fact, this is only the result when the two tables are small. If the table is very large, the effect of optimization will be very obvious. Materialization Optimization
The above subquery is a correlated subquery. If the subquery is an independent subquery, the optimizer can choose to fill the results of the independent subquery into a single materialized temporary table, as shown in the figure: According to the order of JOIN, Materialization optimization can be divided into:
The following subquery can be optimized using Materialization: SELECT OrderID FROM Orders where OrderID in (select OrderID from `Order Details` where UnitPrice < 50 ); Execution plan of the SQL statement: It can be seen that when performing JOIN (that is, the step with id 1), the first table scanned is Orders, and then subquery2, so this is an optimization of Materialization lookup. For the following SQL: select * FROM driver where driver_id in (select uid from user); Execution plan of the SQL statement: Subquery2 is scanned first, and then the driver table. This is the optimization of Materialization scan. FirstMacth OptimizationWhen scanning the inner tables for row combinations and there are multiple instances of a given value group, choose one rather than returning them all. This "shortcuts" scanning and eliminates production of unnecessary rows. This provides an early exit mechanism for table scans and eliminates the generation of unnecessary records. The way the semi-join's FirstMatch strategy executes subqueries is very similar to IN-TO-EXISTS in earlier MySQL versions. For each matching record in the outer table, MySQL checks for a match in the inner table. When a match is found, it returns the record from the foreign table. Only if no match is found will the engine fall back to scan the entire internal table. LooseScan OptimizationScan a subquery table using an index that enables a single value to be chosen from each subquery's value group. SEMI JOIN variablesEach of these strategies except Duplicate Weedout can be enabled or disabled using the optimizer_switch system variable. The semijoin flag controls whether semi-joins are used. If it is set to on, the firstmatch, loosescan, and materialization flags enable finer control over the permitted semi-join strategies. These flags are on by default. Each strategy except Duplicate Weedout can be enabled or disabled using the optimizer_switch system variable. The semijoin flag controls whether semi-joins optimization is enabled. If it is set to on, other strategies also have independent variable control. All variables are enabled by default in 5.6. mysql> SELECT @@optimizer_switch\G; *************************** 1. row *************************** @@optimizer_switch: index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,engine_condition_pushdown=on,index_condition_pushdown=on,mrr=on,mrr_cost_based=on,block_nested_loop=on,batched_key_access=off,materialization=on,semijoin=on,loosescan=on,firstmatch=on,duplicateweedout=on,subquery_materialization_cost_based=on,use_index_extensions=on,condition_fanout_filter=on,derived_merge=on 1 row in set (0.00 sec) EXPLAIN view strategy
There are no good examples to show the specific effects of FirstMacth optimization and LooseScan optimization in the above introduction. There are opportunities to communicate and learn. refer to"MySql Technology Insider: SQL Programming" http://dev.mysql.com/doc/refman/5.6/en/subquery-optimization.html http://tech.it168.com/a2013/0506/1479/000001479749.shtml This is the end of this article about the execution and optimization of MySql subquery IN. For more related MySql subquery IN content, please search for previous articles on 123WORDPRESS.COM or continue to browse the following related articles. I hope you will support 123WORDPRESS.COM in the future! You may also be interested in:
|
<<: A very detailed explanation of the Linux DHCP service
>>: Summary of new usage examples of computed in Vue3
This article shares the installation tutorial of ...
Preface In this article, we will use Docker to bu...
1. Block-level element: refers to the ability to e...
Portainer is a lightweight docker environment man...
The previous article explained how to reset the M...
I am going to review Java these two days, so I wr...
1. Download from official website: https://dev.my...
This article shares the specific code of Vue to s...
Table of contents Avoid using the spread operator...
I have read countless my.cnf configurations on th...
Table of contents Overview 1. Develop the require...
1. Background execution Generally, programs on Li...
Application software generally has such business ...
Recently I wrote in my blog that in the project l...
react-native installation process 1.npx react-nat...