Implementation and optimization of MySql subquery IN

Implementation and optimization of MySql subquery IN

Why is IN slow?

After using subqueries in the application, the query performance of the SQL statement becomes very bad. For example:

SELECT driver_id FROM driver where driver_id in (SELECT driver_id FROM driver where _create_date > '2016-07-25 00:00:00');

The independent subquery returns the driver_id that meets the conditions. This problem is solved, but it takes 6 seconds. You can view the execution plan of the SQL statement through EXPLAIN:

You can see that the above SQL statement becomes a correlated subquery. Through the EXPLAIN EXTENDED and SHOW WARNINGS commands, you can see the following results:

Copy the code as follows:
select `northwind`.`driver`.`driver_id` AS `driver_id` from `northwind`.`driver` where <in_optimizer>(`northwind`.`driver`.`driver_id`,<exists>(select 1 from `northwind`.`driver` where ((`northwind`.`driver`.`_create_date` > '2016-07-25 00:00:00') and (<cache>(`northwind`.`driver`.`driver_id`) = `northwind`.`driver`.`driver_id`))))

It can be seen that the MySql optimizer directly converts the IN clause into a correlated subquery of EXISTS. Here is a correlated IN subquery:

SELECT driver_id FROM driver where driver_id in (SELECT driver_id FROM user where user.uid = driver.driver_id);

View the execution plan of the SQL statement:

This is a correlated subquery. Through the EXPLAIN EXTENDED and SHOW WARNINGS commands, we can see the following results:

Copy the code as follows:
select `northwind`.`driver`.`driver_id` AS `driver_id` from `northwind`.`driver` where <in_optimizer>(`northwind`.`driver`.`driver_id`,<exists>(select 1 from `northwind`.`user` where ((`northwind`.`user`.`uid` = `northwind`.`driver`.`driver_id`) and (<cache>(`northwind`.`driver`.`driver_id`) = `northwind`.`driver`.`driver_id`))))

It can be seen that the optimizer before MySQL 5.5 converts IN into EXISTS statements regardless of whether it is an independent subquery or a correlated subquery. If the subquery and outer query return M and N rows respectively, then the subquery is scanned in O(N+N*M) instead of O(N+M). This is why IN is slow.

Which is faster, IN or EXISTS?

Baidu has found many articles online that say it is wrong to say that IN and EXISTS are equally efficient.

If the two tables being queried are of similar size, there is little difference between using in and exists.
If one of the two tables is smaller and the other is larger, use exists for the larger subquery table and in for the smaller subquery table:
For example: Table A (small table), Table B (large table)
1:
select * from A where cc in (select cc from B) is inefficient because it uses the index of the cc column in table A.
select * from A where exists(select cc from B where cc=A.cc) is efficient because it uses the index of the cc column in table B.

Contrary

2:
select * from B where cc in (select cc from A) is efficient because it uses the index of the cc column in table B.
select * from B where exists(select cc from A where cc=B.cc) is inefficient because it uses the index of the cc column in table A.

To summarize the above description, I personally think that the main reason lies in the use of indexes. In any case, as long as the index of a large table is used, efficiency can be improved.

However, when editing this article, I tested it many times but failed to get the results summarized above. The following is the test SQL statement. First, the outer table is a large table and the inner table is a small table. (Example 1)

SELECT count(driver_id) FROM driver where driver_id in (SELECT uid FROM user);
SELECT count(driver_id) FROM driver where exists (SELECT 1 FROM user where uid = driver.driver_id);

The execution result is:

Then the outer surface is a small table, and the inner surface is a big table. (Example 2)

select count(uid) from user where uid in (SELECT driver_id FROM driver);
select count(uid) from user where exists (SELECT 1 FROM driver where driver.driver_id = user.uid);

The execution result is:

It can be found that the execution efficiency of IN and EXISTS is exactly the same in any case. Based on this, we continue to look at the execution plans of the first and second SQL statements in Example 1, as follows:

It can be seen that the execution plans of IN and EXISTS are the same, and the conclusion drawn from this is that the execution efficiency of the two should be the same.

"MySql Technology Insider: SQL Programming": The book describes that many DBAs do think that EXISTS is more efficient than IN. It may be that the optimizer was not very stable or good enough at the time. However, in most cases, IN and EXISTS have the same execution plan.

How to improve efficiency?

The SQL statement in Example 2 above takes about 8 seconds to execute. The slow query is caused by the existence of M*N, but it can still be optimized. Note that the reason for the slowness is that each time the internal query is compared with the external query, it is necessary to traverse the table once. Another method can be used to nest a layer of subqueries to avoid multiple traversal operations. The statement is as follows:

SELECT count(driver_id) FROM driver where exists (SELECT uid FROM (SELECT uid from user) as b where b.uid = driver.driver_id);

The execution effect is as shown below:

It can be found that the optimization reduces the execution time by more than 6 seconds. The following is the SQL execution plan:

It is still a correlated subquery, but the internal traversal query operations are reduced. Therefore, pre-query can be used to reduce traversal operations and improve efficiency.

In fact, in actual programming, many developers choose not to use join table queries, but to first retrieve the data from one table and then perform a WHEREIN operation in another table. The principle is the same as that implemented in the above SQL statement.

MySQL 5.6 optimizes subqueries?

SEMI JOIN strategy

The optimizer recognizes that the IN statement requires a subquery to return one instance of each region key from the region table. This causes MySQL to execute the SELECT statement in a semi-joined manner, so there will only be one instance of each region in the global table that matches the record.

There are two very important differences between semi-joins and regular joins:

  • In a semi-join, the inner table will not result in duplicate results.
  • This operation will not add the fields in the internal table to the result.

Therefore, the result of a semi-join is often a subset of the records from the outer table. From the perspective of effectiveness, the optimization of semi-join is to effectively eliminate duplicate items from the inner table. MySQL applies four different semi-join execution strategies to eliminate duplicate items.

Table Pullout Optimization

Convert the subquery to a join, or use table pullout and run the query as an inner join between subquery tables and outer tables. Table pullout pulls a table out from the subquery to the outer query. Convert the subquery to a join, or use table pullout and run the query as an inner join between subquery tables and outer tables. Table pullout extracts a table from a subquery for the outer query.

Sometimes, a subquery can be rewritten as a JOIN, for example:

SELECT OrderID FROM Orders where EmployeeID IN (select EmployeeID from Employees where EmployeeID > 3);

If it is known that OrderID is unique, that is, a primary key or a unique index, the SQL statement will be rewritten as a Join.

SELECT OrderID FROM Orders join Employees where Orders.EmployeeID = Employees.EmployeeID and Employees.EmployeeID > 3;

The purpose of table pullout is to rewrite the subquery into a JOIN statement based on the unique index. In MySQL 5.5, the execution plan of the above SQL statement is:

If you use the EXPLAIN EXTENDED and SHOW WARNINGS commands, you can see the following results:

Copy the code as follows:
select `northwind`.`Orders`.`OrderID` AS `OrderID` from `northwind`.`Orders` where <in_optimizer>(`northwind`.`Orders`.`EmployeeID`,<exists>(<primary_index_lookup>(<cache>(`northwind`.`Orders`.`EmployeeID`) in Employees on PRIMARY where ((`northwind`.`Employees`.`EmployeeID` > 3) and (<cache>(`northwind`.`Orders`.`EmployeeID`) = `northwind`.`Employees`.`EmployeeID`)))))

It is exactly why in is slow as mentioned above?

In MySQL 5.6, the optimizer will rewrite the SQL statement and obtain the following execution plan:

In MySQL 5.6, the optimizer does not rewrite independent subqueries into correlated subqueries. The optimizer's execution mode is obtained through the EXPLAIN EXTENDED and SHOW WARNINGS commands:

Copy the code as follows:
/* select #1 */ select `northwind`.`orders`.`OrderID` AS `OrderID` from `northwind`.`employees` join `northwind`.`orders` where ((`northwind`.`orders`.`EmployeeID` = `northwind`.`employees`.`EmployeeID`) and (`northwind`.`employees`.`EmployeeID` > 3))

Obviously, the optimizer rewrites the above subquery into a JOIN statement, which is the Table Pullout optimization.

Duplicate Weedout Optimization

Run the semi-join as if it was a join and remove duplicate records using a temporary table.

The columns found in the inner table above are unique, so the optimizer will rewrite the subquery into a JOIN statement to improve the efficiency of SQL execution. Duplicate Weedout optimization means that the columns in the external query condition are unique, and the MySql optimizer will first deduplicate the results found by the subquery. For example, the following SQL statement:

SELECT ContactName FROM Customers where CustomerID in (select CustomerID from Orders where OrderID > 10000 and Customers.Country = Orders.ShipCountry);

Because CustomerID is the primary key, the results obtained from the subquery should be deduplicated. Execution plan in MySql 5.6:

The "Start temporary" in the "Extra" option means to create a temporary table for deduplication, and "End temporary" means to delete the temporary table. Through the EXPLAIN EXTENDED and SHOW WARNINGS commands, the optimizer's execution mode is obtained as follows:

Copy the code as follows:
/* select #1 */ select `northwind`.`customers`.`ContactName` AS `ContactName` from `northwind`.`customers` semi join (`northwind`.`orders`) where ((`northwind`.`customers`.`CustomerID` = `northwind`.`orders`.`CustomerID`) and (`northwind`.`customers`.`Country` = `northwind`.`orders`.`ShipCountry`) and (`northwind`.`orders`.`OrderID` > 10000))

Different from the Table Pullout optimization, semi join is displayed instead of join. The reason is that there is more deduplication work. For the above execution plan, the scanning cost is about 830+830*1=1660 times.
The execution plan in MySql 5.5 is:

As you can see, in MySql 5.5, the statement is still converted into a correlated subquery, and the scanning cost is about 93+93*9=930 times.

We can see that after optimization, the scanning cost of MySql 5.6 is higher than that of 5.5. In fact, this is only the result when the two tables are small. If the table is very large, the effect of optimization will be very obvious.

Materialization Optimization

Materialize the subquery into a temporary table with an index and use the temporary table to perform a join. The index is used to remove duplicates. The index might also be used later for lookups when joining the temporary table with the outer tables; if not, the table is scanned.

The above subquery is a correlated subquery. If the subquery is an independent subquery, the optimizer can choose to fill the results of the independent subquery into a single materialized temporary table, as shown in the figure:

According to the order of JOIN, Materialization optimization can be divided into:

  • Materialization scan: JOIN is to associate the materialized temporary table with the table.
  • Materialization lookup: JOIN is to associate the table with the materialized temporary table.

The following subquery can be optimized using Materialization:

SELECT OrderID FROM Orders where OrderID in (select OrderID from `Order Details` where UnitPrice < 50 );

Execution plan of the SQL statement:

It can be seen that when performing JOIN (that is, the step with id 1), the first table scanned is Orders, and then subquery2, so this is an optimization of Materialization lookup. For the following SQL:

select * FROM driver where driver_id in (select uid from user);

Execution plan of the SQL statement:

Subquery2 is scanned first, and then the driver table. This is the optimization of Materialization scan.

FirstMacth Optimization

When scanning the inner tables for row combinations and there are multiple instances of a given value group, choose one rather than returning them all. This "shortcuts" scanning and eliminates production of unnecessary rows. This provides an early exit mechanism for table scans and eliminates the generation of unnecessary records.

The way the semi-join's FirstMatch strategy executes subqueries is very similar to IN-TO-EXISTS in earlier MySQL versions. For each matching record in the outer table, MySQL checks for a match in the inner table. When a match is found, it returns the record from the foreign table. Only if no match is found will the engine fall back to scan the entire internal table.

LooseScan Optimization

Scan a subquery table using an index that enables a single value to be chosen from each subquery's value group.

SEMI JOIN variables

Each of these strategies except Duplicate Weedout can be enabled or disabled using the optimizer_switch system variable. The semijoin flag controls whether semi-joins are used. If it is set to on, the firstmatch, loosescan, and materialization flags enable finer control over the permitted semi-join strategies. These flags are on by default. Each strategy except Duplicate Weedout can be enabled or disabled using the optimizer_switch system variable. The semijoin flag controls whether semi-joins optimization is enabled. If it is set to on, other strategies also have independent variable control. All variables are enabled by default in 5.6.

mysql> SELECT @@optimizer_switch\G;
*************************** 1. row ***************************
@@optimizer_switch: index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,engine_condition_pushdown=on,index_condition_pushdown=on,mrr=on,mrr_cost_based=on,block_nested_loop=on,batched_key_access=off,materialization=on,semijoin=on,loosescan=on,firstmatch=on,duplicateweedout=on,subquery_materialization_cost_based=on,use_index_extensions=on,condition_fanout_filter=on,derived_merge=on
1 row in set (0.00 sec)

EXPLAIN view strategy

  • Semi-joined tables show up in the outer select. EXPLAIN EXTENDED plus SHOW WARNINGS shows the rewritten query, which displays the semi-join structure. From this you can get an idea about which tables were pulled out of the semi-join. If a subquery was converted to a semi-join, you will see that the subquery predicate is gone and its tables and WHERE clause were merged into the outer query join list and WHERE clause.
  • Temporary table use for Duplicate Weedout is indicated by Start temporary and End temporary in the Extra column. Tables that were not pulled out and are in the range of EXPLAIN output rows covered by Start temporary and End temporary will have their rowid in the temporary table.
  • FirstMatch(tbl_name) in the Extra column(列) indicates join shortcutting.
  • LooseScan(m..n) in the Extra column indicates use of the LooseScan strategy. m and n are key part numbers.
  • As of MySQL 5.6.7, temporary table use for materialization is indicated by rows with a select_type value of MATERIALIZED and rows with a table value of .
  • Before MySQL 5.6.7, temporary table use for materialization was indicated in the Extra column by Materialize if a single table is used, or by Start materialize and End materialize if multiple tables are used. If Scan is present, no temporary table index is used for table reads. Otherwise, an index lookup is used.

There are no good examples to show the specific effects of FirstMacth optimization and LooseScan optimization in the above introduction. There are opportunities to communicate and learn.

refer to

"MySql Technology Insider: SQL Programming"

http://dev.mysql.com/doc/refman/5.6/en/subquery-optimization.html

http://tech.it168.com/a2013/0506/1479/000001479749.shtml

This is the end of this article about the execution and optimization of MySql subquery IN. For more related MySql subquery IN content, please search for previous articles on 123WORDPRESS.COM or continue to browse the following related articles. I hope you will support 123WORDPRESS.COM in the future!

You may also be interested in:
  • Problems with join queries and subqueries in MySQL
  • Solution to the problem that order by is not effective in MySQL subquery
  • Basic use of subqueries in MySQL
  • Why MySQL does not recommend using subqueries and joins
  • How to create an index on a join table in MySQL
  • mysql subquery and join table details

<<:  A very detailed explanation of the Linux DHCP service

>>:  Summary of new usage examples of computed in Vue3

Recommend

MySQL 5.7 zip archive version installation tutorial

This article shares the installation tutorial of ...

Complete steps to build a Laravel development environment using Docker

Preface In this article, we will use Docker to bu...

The difference between html block-level tags and inline tags

1. Block-level element: refers to the ability to e...

Tutorial on using portainer to connect to remote docker

Portainer is a lightweight docker environment man...

How to set remote access permissions in MySQL 8.0

The previous article explained how to reset the M...

Vue simple implementation of turntable lottery

This article shares the specific code of Vue to s...

Tips for writing concise React components

Table of contents Avoid using the spread operator...

my.cnf parameter configuration to optimize InnoDB engine performance

I have read countless my.cnf configurations on th...

Linux nohup to run programs in the background and view them (nohup and &)

1. Background execution Generally, programs on Li...

A simple way to restart QT application in embedded Linux (based on QT4.8 qws)

Application software generally has such business ...

React Native environment installation process

react-native installation process 1.npx react-nat...