Comparative Analysis of IN and Exists in MySQL Statements

Comparative Analysis of IN and Exists in MySQL Statements

Background

Recently, when writing SQL statements, I was hesitant about whether to choose IN or Exists, so I wrote out the SQL of both methods to compare the execution efficiency. I found that the query efficiency of IN is much higher than that of Exists, so I naturally assumed that the efficiency of IN is better than Exists. However, in line with the principle of getting to the bottom of things, I want to know whether this conclusion applies to all scenarios and why this result occurs.
I checked the relevant information online and it can be roughly summarized as follows: when the external table is small and the internal table is large, Exists is applicable; when the external table is large and the internal table is small, IN is applicable. Then I am confused, because in my SQL statement, the outer table has only 10,000 data, and the inner table has 300,000 data. According to the Internet, the efficiency of Exists should be higher than IN, but my result is just the opposite! !
"No investigation, no right to speak"! So I started to study the actual execution process of IN and Exists, and from a practical perspective, I tried to find the root cause, which led to this blog post.

Experimental data

My experimental data includes two tables: t_author table and t_poetry table.
The amount of data in the corresponding table:

t_author table, 13355 records;
t_poetry table, 289,917 records.

The corresponding table structure is as follows:

CREATE TABLE t_poetry (
id bigint(20) NOT NULL AUTO_INCREMENT,
poetry_id bigint(20) NOT NULL COMMENT 'Poetry id',
poetry_name varchar(200) NOT NULL COMMENT 'Poetry name',
<font color=red> author_id bigint(20) NOT NULL COMMENT 'Author id'</font>
PRIMARY KEY ( id ),
UNIQUE KEY pid_idx ( poetry_id ) USING BTREE,
KEY aid_idx ( author_id ) USING BTREE
) ENGINE=InnoDB AUTO_INCREMENT=291270 DEFAULT CHARSET=utf8mb4

CREATE TABLE t_author (
id int(15) NOT NULL AUTO_INCREMENT,
author_id bigint(20) NOT NULL,</font>
author_name varchar(32) NOT NULL,
dynasty varchar(16) NOT NULL,
poetry_num int(8) NOT NULL DEFAULT '0'
PRIMARY KEY ( id ),
UNIQUE KEY authorid_idx ( author_id ) USING BTREE
) ENGINE=InnoDB AUTO_INCREMENT=13339 DEFAULT CHARSET=utf8mb4

Execution plan analysis IN execution process

SQL example: select * from tabA where tabA.x in (select x from tabB where y>0 );

Its execution plan:
(1) Execute the subquery of the tabB table to obtain the result set B, and use the index y of the tabB table;
(2) Execute a query on the tabA table. The query condition is that tabA.x is in result set B. The index x of the tabA table can be used.

Exists execution process

SQL example: select from tabA where exists (select from tabB where y>0);

Its execution plan:

(1) First retrieve all records from the tabA table.
(2) For each record of table tabA, associate table tabB row by row to determine whether the subquery of table tabB returns any data. Versions after 5.5 use Block Nested Loop.
(3) If the subquery returns data, the current record of tabA is returned to the result set.
tabA is equivalent to traversing the entire table data, and tabB can use the index.

Experimental procedures

The experiment analyzes the IN and Exists SQL statements for the same result set.
SQL statement containing IN:

select from t_author ta where author_id in
(select author_id from t_poetry tp where tp.poetry_id>3650 );

SQL statement containing Exists:

select from t_author ta where exists
(select * from t_poetry tp where tp.poetry_id>3650 and tp.author_id=ta.author_id);

Data from the first experiment

t_author table, 13355 records; t_poetry table, subquery filter result set where poetry_id>293650, 121 records;

Execution Results

Using exists takes 0.94 seconds, and using in takes 0.03 seconds. IN is more efficient than Exists.

Cause Analysis

The result set of the subquery on the t_poetry table is very small, and both can use indexes on the t_poetry table, and the consumption of the t_poetry subquery is basically the same. The difference between the two is that when using in, the t_author table can use the index:


When using exists, the t_author table is fully scanned:


When the subquery result set is small, the query time is mainly reflected in the traversal of the t_author table.

Second experimental data

t_author table, 13,355 records; t_poetry table, subquery filter result set where poetry_id>3650, 287,838 records;

Execution time

Using exists takes 0.12 seconds, and using in takes 0.48 seconds. Exists is more efficient than IN .

Cause Analysis

The index usage of the two experiments is consistent with that of the first experiment. The only difference is the size of the subquery filtering result set. However, the experimental results are different from the first one. In this case, the subquery result set is very large. Let's look at the MySQL query plan:
When using in, since the subquery result set is very large, both t_author and t_poetry tables are close to full table scans. At this time, the difference in the time consumption of traversing the t_author table can be ignored on the overall efficiency. There is one more line <auto_key> in the execution plan. In the case of a close full table scan, the MySQL optimizer chooses auto_key to traverse the t_author table:

When using exists, the change in data volume does not change the execution plan. However, due to the large subquery result set, MySQL versions after 5.5 use Block Nested-Loop (Block nested loop, introducing join buffer, similar to cache function) when matching the query results of exists, which begins to have a significant impact on query efficiency, especially when the subquery result set is large, it can significantly improve the query matching efficiency:

Experimental Conclusion

Based on the above two experiments and their results, we can clearly understand the execution process of IN and Exists, and summarize the applicable scenarios of IN and Exists:

IN queries can use indexes on both inner and outer tables; Exists queries can use indexes only on inner tables. When the subquery result set is large and the outer table is small, the Block Nested Loop of Exists begins to play a role and makes up for the defect that the outer table cannot use indexes, and the query efficiency will be better than IN. When the subquery result set is small and the external table is large, the optimization effect of the Block nested loop of Exists is not obvious, and the advantage of the external table index of IN plays a major role. In this case, the query efficiency of IN is better than that of Exists. What is said online is inaccurate. In fact, the "size of the table" does not refer to the internal table and the external table, but the external table and the subquery result set . The last point is also the most important one: there is no absolute truth in the world. Grasping the essence of things and conducting practical verification for different scenarios is the most reliable and effective method. Supplementary information on problems found during the experiment

When analyzing the above exists statements under different data sets, I found that the larger the data set, the shorter the time consumed, which is very strange.
The specific query conditions are:

where tp.poetry_id>3650, takes 0.13S
where tp.poetry_id>293650, takes 0.46S

Possible reasons: The larger the condition value is, the later the query is, the more records need to be traversed, resulting in more time consumption. This explanation needs to be further verified.

You may also be interested in:
  • Comparison of the usage of EXISTS and IN in MySQL
  • Basic usage of exists, in and any in MySQL
  • Introduction to the use and difference between in and exists in MySQL
  • MySQL exists and in detailed explanation and difference
  • Summary of the differences between in query and exists query in mySQL
  • MYSQL IN and EXISTS optimization examples
  • mysql not in, left join, IS NULL, NOT EXISTS efficiency problem record
  • Detailed explanation of the difference between in and exists in MySQL

<<:  How to view and terminate running background programs in Linux

>>:  Detailed explanation of Vue custom instructions and their use

Recommend

How to run MySQL using docker-compose

Directory Structure . │ .env │ docker-compose.yml...

How to extend Vue Router links in Vue 3

Preface The <router-link> tag is a great to...

CentOS7.x uninstall and install MySQL5.7 operation process and encoding format modification method

1. Uninstalling MySQL 5.7 1.1查看yum是否安裝過mysql cd y...

Solution to EF (Entity Framework) inserting or updating data errors

Error message: Store update, insert, or delete st...

Solution to the problem of web page flash animation not displaying

<br />The solution steps are as follows: Sta...

Several reasons for not compressing HTML

The reason is simple: In HTML documents, multiple ...

A brief talk about JavaScript parasitic composition inheritance

Composition inheritance Combination inheritance i...

Vue detailed explanation of mixins usage

Table of contents Preface 1. What are Mixins? 2. ...

Automatic failover of slave nodes in replication architecture in MySQL 8.0.23

I have been in contact with MGR for some time. Wi...

Detailed process of modifying hostname after Docker creates a container

There is a medicine for regret in the world, as l...

Customization Method of Linux Peripheral File System

Preface Generally speaking, when we talk about Li...

Build a Scala environment under Linux and write a simple Scala program

It is very simple to install Scala environment in...

CSS3 realizes the effect of triangle continuous enlargement

1. CSS3 triangle continues to zoom in special eff...

Use Typescript configuration steps in Vue

Table of contents 1. TypeScript is introduced int...