Background Recently, when writing SQL statements, I was hesitant about whether to choose IN or Exists, so I wrote out the SQL of both methods to compare the execution efficiency. I found that the query efficiency of IN is much higher than that of Exists, so I naturally assumed that the efficiency of IN is better than Exists. However, in line with the principle of getting to the bottom of things, I want to know whether this conclusion applies to all scenarios and why this result occurs. Experimental data My experimental data includes two tables: t_author table and t_poetry table. t_author table, 13355 records; The corresponding table structure is as follows:
Execution plan analysis IN execution process SQL example: Its execution plan: Exists execution process SQL example: Its execution plan: (1) First retrieve all records from the tabA table. Experimental procedures The experiment analyzes the IN and Exists SQL statements for the same result set.
SQL statement containing Exists:
Data from the first experiment t_author table, 13355 records; t_poetry table, subquery filter result set where poetry_id>293650, 121 records; Execution Results Using exists takes 0.94 seconds, and using in takes 0.03 seconds. IN is more efficient than Exists. Cause Analysis The result set of the subquery on the t_poetry table is very small, and both can use indexes on the t_poetry table, and the consumption of the t_poetry subquery is basically the same. The difference between the two is that when using in, the t_author table can use the index: When using exists, the t_author table is fully scanned: When the subquery result set is small, the query time is mainly reflected in the traversal of the t_author table. Second experimental data t_author table, 13,355 records; t_poetry table, subquery filter result set where poetry_id>3650, 287,838 records; Execution time Using exists takes 0.12 seconds, and using in takes 0.48 seconds. Exists is more efficient than IN . Cause Analysis The index usage of the two experiments is consistent with that of the first experiment. The only difference is the size of the subquery filtering result set. However, the experimental results are different from the first one. In this case, the subquery result set is very large. Let's look at the MySQL query plan: When using exists, the change in data volume does not change the execution plan. However, due to the large subquery result set, MySQL versions after 5.5 use Block Nested-Loop (Block nested loop, introducing join buffer, similar to cache function) when matching the query results of exists, which begins to have a significant impact on query efficiency, especially when the subquery result set is large, it can significantly improve the query matching efficiency: Experimental Conclusion Based on the above two experiments and their results, we can clearly understand the execution process of IN and Exists, and summarize the applicable scenarios of IN and Exists: IN queries can use indexes on both inner and outer tables; Exists queries can use indexes only on inner tables. When the subquery result set is large and the outer table is small, the Block Nested Loop of Exists begins to play a role and makes up for the defect that the outer table cannot use indexes, and the query efficiency will be better than IN. When the subquery result set is small and the external table is large, the optimization effect of the Block nested loop of Exists is not obvious, and the advantage of the external table index of IN plays a major role. In this case, the query efficiency of IN is better than that of Exists. What is said online is inaccurate. In fact, the "size of the table" does not refer to the internal table and the external table, but the external table and the subquery result set . The last point is also the most important one: there is no absolute truth in the world. Grasping the essence of things and conducting practical verification for different scenarios is the most reliable and effective method. Supplementary information on problems found during the experiment When analyzing the above exists statements under different data sets, I found that the larger the data set, the shorter the time consumed, which is very strange.
Possible reasons: The larger the condition value is, the later the query is, the more records need to be traversed, resulting in more time consumption. This explanation needs to be further verified. You may also be interested in:
|
<<: How to view and terminate running background programs in Linux
>>: Detailed explanation of Vue custom instructions and their use
Directory Structure . │ .env │ docker-compose.yml...
Preface The <router-link> tag is a great to...
1. Uninstalling MySQL 5.7 1.1查看yum是否安裝過mysql cd y...
Error message: Store update, insert, or delete st...
<br />The solution steps are as follows: Sta...
The reason is simple: In HTML documents, multiple ...
Composition inheritance Combination inheritance i...
Table of contents Preface 1. What are Mixins? 2. ...
I have been in contact with MGR for some time. Wi...
There is a medicine for regret in the world, as l...
Preface Every time I use the terminal to create a...
Preface Generally speaking, when we talk about Li...
It is very simple to install Scala environment in...
1. CSS3 triangle continues to zoom in special eff...
Table of contents 1. TypeScript is introduced int...