How to optimize MySQL deduplication operation to the extreme

How to optimize MySQL deduplication operation to the extreme

• Questions raised

The structure of the source table t_source is as follows:

item_id int,
 created_time datetime,
 modified_time datetime,
 item_name varchar(20),
 other varchar(20)

Require:

1. There are 1 million records in the source table, of which 500,000 have duplicate created_time and item_name.
2. Write the 500,000 deduplicated data into the target table.
3. If there are multiple data with the same created_time and item_name, you can keep any one of them without any rule restrictions.

• Experimental environment

Linux virtual machine: CentOS release 6.4; 8G physical memory (MySQL configured with 4G); 100G mechanical hard disk; dual physical CPU dual core, a total of four processors; MySQL 8.0.16.

• Create test tables and data

-- Create a source table create table t_source 
(item_id int, 
 created_time datetime, 
 modified_time datetime, 
 item_name varchar(20), 
 other varchar(20) 
); 
-- Create the target table create table t_target like t_source; 
-- Generate 1 million test data, of which 500,000 created_time and item_name have duplicate delimiters // 
create procedure sp_generate_data() 
begin 
 set @i := 1; 
 while @i<=500000 do 
 set @created_time := date_add('2017-01-01',interval @i second); 
 set @modified_time := @created_time; 
 set @item_name := concat('a',@i); 
 insert into t_source 
 values ​​(@i,@created_time,@modified_time,@item_name,'other'); 
 set @i:=@i+1; 
 end while; 
 commit; 
 set @last_insert_id := 500000; 
 insert into t_source 
 select item_id + @last_insert_id, 
 created_time, 
 date_add(modified_time,interval @last_insert_id second), 
 item_name, 
 'other' 
 from t_source; 
 commit;
end 
// 
delimiter ; 
call sp_generate_data(); 

-- The source table has no primary key or unique constraint, and there may be two identical data, so insert another record to simulate this situation.
insert into t_source select * from t_source where item_id=1;

 There are 1,000,001 records in the source table, and the target table after deduplication should have 500,000 records.
mysql> select count(*),count(distinct created_time,item_name) from t_source;
+----------+----------------------------------------+
| count(*) | count(distinct created_time,item_name) |
+----------+----------------------------------------+
| 1000001 | 500000 |
+----------+----------------------------------------+
1 row in set (1.92 sec)

1. Clever use of indexes and variables

1. No-index comparison test

(1) Using correlated subqueries

truncate t_target; 
insert into t_target 
select distinct t1.* from t_source t1 where item_id in 
(select min(item_id) from t_source t2 where t1.created_time=t2.created_time and t1.item_name=t2.item_name);

This statement will take a long time to produce results, so just look at the execution plan.

mysql> explain select distinct t1.* from t_source t1 where item_id in 
 -> (select min(item_id) from t_source t2 where t1.created_time=t2.created_time and t1.item_name=t2.item_name); 
+----+--------------------+-------+------------+------+---------------+------+---------+------+--------+----------+------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+--------------------+-------+------------+------+---------------+------+---------+------+--------+----------+------------------------------+
| 1 | PRIMARY | t1 | NULL | ALL | NULL | NULL | NULL | NULL | 997282 | 100.00 | Using where; Using temporary |
| 2 | DEPENDENT SUBQUERY | t2 | NULL | ALL | NULL | NULL | NULL | NULL | 997282 | 1.00 | Using where |
+----+--------------------+-------+------------+------+---------------+------+---------+------+--------+----------+------------------------------+
2 rows in set, 3 warnings (0.00 sec)

Both the main query and the related subquery scan the entire table, which requires a total of 1 million * 1 million data rows to be scanned. No wonder no results are produced.

(2) Using table joins

truncate t_target; 
insert into t_target 
select distinct t1.* from t_source t1, 
(select min(item_id) item_id,created_time,item_name from t_source group by created_time,item_name) t2 
where t1.item_id = t2.item_id;

This method takes 14 seconds and the query plan is as follows:

mysql> explain select distinct t1.* from t_source t1, (select min(item_id) item_id,created_time,item_name from t_source group by created_time,item_name) t2 where t1.item_id = t2.item_id;
+----+-------------+------------+------------+------+---------------+-------------+---------+----------------+--------+----------+------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+------------+------------+------+---------------+-------------+---------+----------------+--------+----------+------------------------------+
| 1 | PRIMARY | t1 | NULL | ALL | NULL | NULL | NULL | NULL | 997282 | 100.00 | Using where; Using temporary |
| 1 | PRIMARY | <derived2> | NULL | ref | <auto_key0> | <auto_key0> | 5 | test.t1.item_id | 10 | 100.00 | Distinct |
| 2 | DERIVED | t_source | NULL | ALL | NULL | NULL | NULL | NULL | 997282 | 100.00 | Using temporary |
+----+-------------+------------+------------+------+---------------+-------------+---------+----------------+--------+----------+------------------------------+
3 rows in set, 1 warning (0.00 sec)

•The inner query scans 1 million rows of the t_source table, creates a temporary table, finds the minimum item_id after deduplication, and generates the derived table derived2, which has 500,000 rows.
•MySQL will automatically create an index auto_key0 for the item_id field on the derived table derived2.
•The outer query also needs to scan 1 million rows of data in the t_source table. When linking with the export table, for the item_id of each row in the t_source table, the auto_key0 index is used to find the matching row in the export table, and the distinct operation is optimized at this time. After finding the first matching row, the search for the same value is stopped.

(3) Using variables

set @a:='1000-01-01 00:00:00'; 
set @b:=' '; 
set @f:=0; 
truncate t_target; 
insert into t_target 
select item_id,created_time,modified_time,item_name,other 
 from 
(select t0.*,if(@a=created_time and @b=item_name,@f:=0,@f:=1) f, @a:=created_time,@b:=item_name 
 from 
(select * from t_source order by created_time,item_name) t0) t1 where f=1;

This method takes 13 seconds and the query plan is as follows:

mysql> explain select item_id,created_time,modified_time,item_name,other 
 -> from 
 -> (select t0.*,if(@a=created_time and @b=item_name,@f:=0,@f:=1) f, @a:=created_time,@b:=item_name 
 -> from 
 -> (select * from t_source order by created_time,item_name) t0) t1 where f=1; 
+----+-------------+------------+------------+------+---------------+-------------+---------+-------+--------+----------+----------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+------------+------------+------+---------------+-------------+---------+-------+--------+----------+----------------+
| 1 | PRIMARY | <derived2> | NULL | ref | <auto_key0> | <auto_key0> | 4 | const | 10 | 100.00 | NULL |
| 2 | DERIVED | <derived3> | NULL | ALL | NULL | NULL | NULL | NULL | 997282 | 100.00 | NULL |
| 3 | DERIVED | t_source | NULL | ALL | NULL | NULL | NULL | NULL | 997282 | 100.00 | Using filesort |
+----+-------------+------------+------------+------+---------------+-------------+---------+-------+--------+----------+----------------+
3 rows in set, 5 warnings (0.00 sec)

•The innermost query scans 1 million rows of the t_source table and uses file sort to generate the derived table derived3.
•The second-level query scans 1 million rows of derived3, generates the derived table derived2, completes the comparison and assignment of variables, and automatically creates an index auto_key0 on the derived column f.
•The outermost layer uses the auto_key0 index to scan derived2 to obtain the deduplicated result rows.

Compared with method 2 above, the total number of scanned rows remains unchanged, which is 2 million rows. There is only a slight difference. This time the automatically generated index is on the constant column f, while the automatically generated index for table join is on the item_id column, so the query time is almost the same.

So far, we have not created any indexes on the source table. Regardless of which writing method is used, the created_time and item_name fields need to be sorted to check for duplicates. Therefore, it is natural to think that if a joint index is established on these two fields, the ordered nature of the index itself can be used to eliminate additional sorting, thereby improving query performance.

2. Create a joint index comparison test on created_time and item_name

-- Create a joint index of created_time and item_name fields create index idx_sort on t_source(created_time,item_name,item_id); 
analyze table t_source;

(1) Using correlated subqueries

truncate t_target; 
insert into t_target 
select distinct t1.* from t_source t1 where item_id in 
(select min(item_id) from t_source t2 where t1.created_time=t2.created_time and t1.item_name=t2.item_name);

This time it took 19 seconds, and the query plan is as follows:

mysql> explain select distinct t1.* from t_source t1 where item_id in 
 -> (select min(item_id) from t_source t2 where t1.created_time=t2.created_time and t1.item_name=t2.item_name); 
+----+--------------------+-------+------------+------+---------------+----------+--------+---------+----------------------------------------+--------+----------+------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+--------------------+-------+------------+------+---------------+----------+--------+---------+----------------------------------------+--------+----------+------------------------------+
| 1 | PRIMARY | t1 | NULL | ALL | NULL | NULL | NULL | NULL | 997281 | 100.00 | Using where; Using temporary |
| 2 | DEPENDENT SUBQUERY | t2 | NULL | ref | idx_sort | idx_sort | 89 | test.t1.created_time,test.t1.item_name | 2 | 100.00 | Using index |
+----+--------------------+-------+------------+------+---------------+----------+--------+---------+----------------------------------------+--------+----------+------------------------------+
2 rows in set, 3 warnings (0.00 sec)

•The t_source table of the outer query is the driving table and needs to scan 1 million rows.

•For the item_id of each row in the driver table, two rows of data are queried through the idx_sort index.

(2) Using table joins

truncate t_target; 
insert into t_target 
select distinct t1.* from t_source t1, 
(select min(item_id) item_id,created_time,item_name from t_source group by created_time,item_name) t2 
where t1.item_id = t2.item_id;

This time it took 13 seconds, and the query plan is as follows:

mysql> explain select distinct t1.* from t_source t1, 
 -> (select min(item_id) item_id,created_time,item_name from t_source group by created_time,item_name) t2 
 -> where t1.item_id = t2.item_id; 
+----+-------------+------------+------------+-------+---------------+------------+---------+-----------------+--------+----------+------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+------------+------------+-------+---------------+------------+---------+-----------------+--------+----------+------------------------------+
| 1 | PRIMARY | t1 | NULL | ALL | NULL | NULL | NULL | NULL | 997281 | 100.00 | Using where; Using temporary |
| 1 | PRIMARY | <derived2> | NULL | ref | <auto_key0> | <auto_key0> | 5 | test.t1.item_id | 10 | 100.00 | Distinct |
| 2 | DERIVED | t_source | NULL | index | idx_sort | idx_sort | 94 | NULL | 997281 | 100.00 | Using index |
+----+-------------+------------+------------+-------+---------------+------------+---------+-----------------+--------+----------+------------------------------+
3 rows in set, 1 warning (0.00 sec)

Compared with no index, although the subquery changes from a full table scan to a full index scan, it still needs to scan 1 million rows of records. Therefore, the query performance improvement is not obvious.

(3) Using variables

set @a:='1000-01-01 00:00:00'; 
set @b:=' '; 
set @f:=0; 
truncate t_target; 
insert into t_target 
select item_id,created_time,modified_time,item_name,other 
 from 
(select t0.*,if(@a=created_time and @b=item_name,@f:=0,@f:=1) f, @a:=created_time,@b:=item_name 
 from 
(select * from t_source order by created_time,item_name) t0) t1 where f=1;

This time it took 13 seconds, and the query plan was exactly the same as when there was no index. It can be seen that the index has no effect on this writing method. Is it possible to eliminate nesting and query the results with only one layer?

(4) Use variables and eliminate nested queries

set @a:='1000-01-01 00:00:00'; 
set @b:=' '; 
truncate t_target; 
insert into t_target 
select * from t_source force index (idx_sort) 
 where (@a!=created_time or @b!=item_name) and (@a:=created_time) is not null and (@b:=item_name) is not null 
 order by created_time,item_name;

This time it took 12 seconds, and the query plan is as follows:

mysql> explain select * from t_source force index (idx_sort) 
 -> where (@a!=created_time or @b!=item_name) and (@a:=created_time) is not null and (@b:=item_name) is not null 
 -> order by created_time,item_name;
+----+-------------+----------+------------+-------+---------------+----------+---------+------+--------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+----------+------------+-------+---------------+----------+---------+------+--------+----------+-------------+
| 1 | SIMPLE | t_source | NULL | index | NULL | idx_sort | 94 | NULL | 997281 | 99.00 | Using where |
+----+-------------+----------+------------+-------+---------------+----------+---------+------+--------+----------+-------------+
1 row in set, 3 warnings (0.00 sec)

This statement has the following characteristics:

•The nested subquery is eliminated, and only one full index scan of the t_source table is required, so the query plan is optimal.
•No need to use DISTINCT to check for duplicate content again.
•Variable judgment and assignment only appear in the where clause.
• Eliminate filesort by using indexes.

Prior to MySQL 8, this statement was the best solution for single-threaded deduplication. A careful analysis of this statement reveals that it cleverly utilizes the logical query processing steps and index features of SQL statements. The logical steps of a SQL query are:

1. Perform a Cartesian product (cross join)
2. Apply ON filter (join condition)
3. Add outer rows (outer join)
4. Apply the where filter
5. Grouping
6. Apply cube or rollup
7. Apply the having filter
8. Processing select lists
9. Apply the distinct clause
10. Apply the order by clause
11. Apply the limit clause

The logical execution steps of each query statement are a subset of these 11 steps. Take this query statement as an example. Its execution order is: force the search for data rows through the index idx_sort -> apply the where filter -> process the select list -> apply the order by clause.

In order for the variables to be assigned and compared in sorted order of created_time and item_name, the data rows must be found in index order. The force index (idx_sort) hint here plays this role and must be written in this way to make the entire duplicate check statement valid. Otherwise, because the table is scanned first and then sorted, the order of variable assignment cannot be guaranteed, and the correctness of the query results cannot be ensured. The order by clause cannot be ignored either. Otherwise, even with the force index prompt, MySQL will use a full table scan instead of a full index scan, resulting in incorrect results. The index also ensures the order of created_time and item_name, avoiding file sorting. force index (idx_sort) hint and the order by clause are both indispensable. The index idx_sort is just right here, killing two birds with one stone.

Before the query statement starts, initialize the variable to a value that is impossible to appear in the data, and then enter the where clause to judge from left to right. First compare the variable and field values, then assign the created_time and item_name values ​​of this row to the variable, and process each row in the order of created_time and item_name. item_name is a string type, and (@b:=item_name) is not a valid Boolean expression, so it should be written as (@b:=item_name) is not null.

Finally, I would like to add that the writing of " insert into t_target select * from t_source group by created_time,item_name ;" is ignored here because it is restricted by “sql_mode='ONLY_FULL_GROUP_BY' ".

2. Using window functions

The new window functions in MySQL 8 make the previously troublesome deduplication operation very simple.

truncate t_target; 
insert into t_target 
select item_id, created_time, modified_time, item_name, other
 from (select *, row_number() over(partition by created_time,item_name) as rn
 from t_source) t where rn=1;

This statement only takes 12 seconds to execute, and the writing is clear and easy to understand. The query plan is as follows:

mysql> explain select item_id, created_time, modified_time, item_name, other
 -> from (select *, row_number() over(partition by created_time,item_name) as rn
 -> from t_source) t where rn=1;
+----+-------------+------------+------------+------+---------------+-------------+---------+-------+--------+----------+----------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+------------+------------+------+---------------+-------------+---------+-------+--------+----------+----------------+
| 1 | PRIMARY | <derived2> | NULL | ref | <auto_key0> | <auto_key0> | 8 | const | 10 | 100.00 | NULL |
| 2 | DERIVED | t_source | NULL | ALL | NULL | NULL | NULL | NULL | 997281 | 100.00 | Using filesort |
+----+-------------+------------+------------+------+---------------+-------------+---------+-------+--------+----------+----------------+
2 rows in set, 2 warnings (0.00 sec)

This query performs a full table scan on the t_source table and uses filesort to sort the table by the partition fields created_time and item_name. The outer query retains one piece of data from each partition. Because any one of the multiple data with repeated created_time and item_name can be retained, there is no need to use the order by clause in oevr.

From the execution plan, the window function deduplication statement does not seem to be as good as the variable deduplication of nested queries, but this method is actually the fastest.

For a description of MySQL window functions, see https://dev.mysql.com/doc/refman/8.0/en/window-functions.html.

3. Multithreaded parallel execution

The single duplicate checking statement has been adjusted to the optimal state, but it is still executed in a single-threaded manner. Can we use multiple processors to execute deduplication operations in parallel using multiple threads, thereby further increasing the speed? For example, my experimental environment is 4 processors. If I use 4 threads to execute the duplicate checking SQL at the same time, theoretically the performance should be improved by nearly 4 times.

1. Data Sharding

When generating test data, created_time uses the method of adding one second to each record, that is, the maximum and minimum time difference is 500,000 seconds, and the data is evenly distributed, so the data is first divided into 4 parts.

(1) Query the created_time boundary values ​​of the four sets of data

mysql> select date_add('2017-01-01',interval 125000 second) dt1,
 -> date_add('2017-01-01',interval 2*125000 second) dt2,
 -> date_add('2017-01-01',interval 3*125000 second) dt3,
 -> max(created_time) dt4
 -> from t_source;
+---------------------+---------------------+---------------------+---------------------+
| dt1 | dt2 | dt3 | dt4 |
+---------------------+---------------------+---------------------+---------------------+
| 2017-01-02 10:43:20 | 2017-01-03 21:26:40 | 2017-01-05 08:10:00 | 2017-01-06 18:53:20 |
+---------------------+---------------------+---------------------+---------------------+
1 row in set (0.00 sec)

(2) Check the number of records in each data set to confirm that the data is evenly distributed

mysql> select case when created_time >= '2017-01-01' 
 -> and created_time < '2017-01-02 10:43:20'
 -> then '2017-01-01'
 -> when created_time >= '2017-01-02 10:43:20'
 -> and created_time < '2017-01-03 21:26:40'
 -> then '2017-01-02 10:43:20'
 -> when created_time >= '2017-01-03 21:26:40' 
 -> and created_time < '2017-01-05 08:10:00'
 -> then '2017-01-03 21:26:40' 
 -> else '2017-01-05 08:10:00'
 -> end min_dt,
 -> case when created_time >= '2017-01-01' 
 -> and created_time < '2017-01-02 10:43:20'
 -> then '2017-01-02 10:43:20'
 -> when created_time >= '2017-01-02 10:43:20'
 -> and created_time < '2017-01-03 21:26:40'
 -> then '2017-01-03 21:26:40'
 -> when created_time >= '2017-01-03 21:26:40' 
 -> and created_time < '2017-01-05 08:10:00'
 -> then '2017-01-05 08:10:00'
 -> else '2017-01-06 18:53:20'
 -> end max_dt,
 -> count(*)
 -> from t_source
 -> group by case when created_time >= '2017-01-01' 
 -> and created_time < '2017-01-02 10:43:20'
 -> then '2017-01-01'
 -> when created_time >= '2017-01-02 10:43:20'
 -> and created_time < '2017-01-03 21:26:40'
 -> then '2017-01-02 10:43:20'
 -> when created_time >= '2017-01-03 21:26:40' 
 -> and created_time < '2017-01-05 08:10:00'
 -> then '2017-01-03 21:26:40' 
 -> else '2017-01-05 08:10:00'
 -> end,
 -> case when created_time >= '2017-01-01' 
 -> and created_time < '2017-01-02 10:43:20'
 -> then '2017-01-02 10:43:20'
 -> when created_time >= '2017-01-02 10:43:20'
 -> and created_time < '2017-01-03 21:26:40'
 -> then '2017-01-03 21:26:40'
 -> when created_time >= '2017-01-03 21:26:40' 
 -> and created_time < '2017-01-05 08:10:00'
 -> then '2017-01-05 08:10:00'
 -> else '2017-01-06 18:53:20'
 -> end;
+---------------------+---------------------+----------+
| min_dt | max_dt | count(*) |
+---------------------+---------------------+----------+
| 2017-01-01 | 2017-01-02 10:43:20 | 249999 |
| 2017-01-02 10:43:20 | 2017-01-03 21:26:40 | 250000 |
| 2017-01-03 21:26:40 | 2017-01-05 08:10:00 | 250000 |
| 2017-01-05 08:10:00 | 2017-01-06 18:53:20 | 250002 |
+---------------------+---------------------+----------+
4 rows in set (4.86 sec)

The union of the four data sets should cover the entire source data set, and there should be no duplication between the data. That is to say, the created_time of the four copies of data must be continuous and mutually exclusive. Continuity ensures that all data are processed, and mutual exclusion ensures that there is no need for a second duplicate check. In fact, this is similar to the concept of time range partitioning. Perhaps it is better to use a partitioned table, but the step of rebuilding the table is omitted here.

2. Create a stored procedure for checking for duplicate content

With the above information we can write 4 statements to process all the data. In order to make the calling interface as simple as possible, create the following stored procedure.

delimiter //
create procedure sp_unique(i smallint) 
begin 
 set @a:='1000-01-01 00:00:00'; 
 set @b:=' '; 
 if (i<4) then
 insert into t_target 
 select * from t_source force index (idx_sort) 
  where created_time >= date_add('2017-01-01',interval (i-1)*125000 second) 
  and created_time < date_add('2017-01-01',interval i*125000 second) 
  and (@a!=created_time or @b!=item_name) 
  and (@a:=created_time) is not null 
  and (@b:=item_name) is not null 
  order by created_time,item_name; 
 else 
 insert into t_target 
 select * from t_source force index (idx_sort) 
  where created_time >= date_add('2017-01-01',interval (i-1)*125000 second) 
  and created_time <= date_add('2017-01-01',interval i*125000 second) 
  and (@a!=created_time or @b!=item_name) 
  and (@a:=created_time) is not null 
  and (@b:=item_name) is not null 
  order by created_time,item_name; 
 end if; 
end 
//

The execution plan of the query statement is as follows:

mysql> explain select * from t_source force index (idx_sort) 
 -> where created_time >= date_add('2017-01-01',interval (1-1)*125000 second) 
 -> and created_time < date_add('2017-01-01',interval 1*125000 second) 
 -> and (@a!=created_time or @b!=item_name) 
 -> and (@a:=created_time) is not null 
 -> and (@b:=item_name) is not null 
 -> order by created_time,item_name; 
+----+-------------+----------+------------+-------+---------------+----------+---------+------+--------+----------+----------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+----------+------------+-------+---------------+----------+---------+------+--------+----------+----------------------+
| 1 | SIMPLE | t_source | NULL | range | idx_sort | idx_sort | 6 | NULL | 498640 | 100.00 | Using index condition |
+----+-------------+----------+------------+-------+---------------+----------+---------+------+--------+----------+----------------------+
1 row in set, 3 warnings (0.00 sec)

The MySQL optimizer performs index range scans and uses index condition pushdown (ICP) to optimize queries.

3. Parallel Execution

The following uses shell background process and MySQL Schedule Event to achieve parallelism.

(1) Shell background process

•Create duplicate_removal.sh file with the following content:

#!/bin/bash
mysql -vvv -u root -p123456 test -e "truncate t_target" &>/dev/null 
date '+%H:%M:%S'
for y in {1..4}
do
 sql="call sp_unique($y)"
 mysql -vvv -u root -p123456 test -e "$sql" &>par_sql1_$y.log &
done
wait
date '+%H:%M:%S'

• Execute script files

./duplicate_removal.sh

The execution output is as follows:

[mysql@hdp2~]$./duplicate_removal.sh
14:27:30
14:27:35

This method takes 5 seconds, and the 4 process calls executed in parallel take 4.87 seconds, 4.88 seconds, 4.91 seconds, and 4.73 seconds respectively:

[mysql@hdp2~]$cat par_sql1_1.log | sed '/^$/d'
mysql: [Warning] Using a password on the command line interface can be insecure.
--------------
call sp_unique(1)
--------------
Query OK, 124999 rows affected (4.87 sec)
Bye
[mysql@hdp2~]$cat par_sql1_2.log | sed '/^$/d'
mysql: [Warning] Using a password on the command line interface can be insecure.
--------------
call sp_unique(2)
--------------
Query OK, 125000 rows affected (4.88 sec)
Bye
[mysql@hdp2~]$cat par_sql1_3.log | sed '/^$/d'
mysql: [Warning] Using a password on the command line interface can be insecure.
--------------
call sp_unique(3)
--------------
Query OK, 125000 rows affected (4.91 sec)
Bye
[mysql@hdp2~]$cat par_sql1_4.log | sed '/^$/d'
mysql: [Warning] Using a password on the command line interface can be insecure.
--------------
call sp_unique(4)
--------------
Query OK, 125001 rows affected (4.73 sec)
Bye
[mysql@hdp2~]$

It can be seen that the execution time of each process is 4.85 seconds. Because it is executed in parallel, the total process execution time is the slowest 4.91 seconds, which is 2.5 times faster than the single-threaded speed.

(2) MySQL Schedule Event

• Create event history log table

-- Used to view event execution time and other information create table t_event_history ( 
 dbname varchar(128) not null default '', 
 eventname varchar(128) not null default '', 
 starttime datetime(3) not null default '1000-01-01 00:00:00', 
 endtime datetime(3) default null, 
 issuccess int(11) default null, 
 duration int(11) default null, 
 errormessage varchar(512) default null, 
 randno int(11) default null
);

• Create an event for each concurrent thread

delimiter //
create event ev1 on schedule at current_timestamp + interval 1 hour on completion preserve disable do 
begin
 declare r_code char(5) default '00000'; 
 declare r_msg text; 
 declare v_error integer; 
 declare v_starttime datetime default now(3); 
 declare v_randno integer default floor(rand()*100001); 
 insert into t_event_history (dbname, eventname, starttime, randno) 
 #Job namevalues(database(),'ev1', v_starttime,v_randno); 
 begin 
 #Exception handling section declare continue handler for sqlexception 
 begin 
  set v_error = 1; 
  get diagnostics condition 1 r_code = returned_sqlstate , r_msg = message_text; 
 end; 
 #This is the actual user program procedure called call sp_unique(1); 
 end; 
 update t_event_history set endtime=now(3),issuccess=isnull(v_error),duration=timestampdiff(microsecond,starttime,now(3)), errormessage=concat('error=',r_code,', message=',r_msg),randno=null where starttime=v_starttime and randno=v_randno; 
end
// 
create event ev2 on schedule at current_timestamp + interval 1 hour on completion preserve disable do 
begin
 declare r_code char(5) default '00000'; 
 declare r_msg text; 
 declare v_error integer; 
 declare v_starttime datetime default now(3); 
 declare v_randno integer default floor(rand()*100001); 
 insert into t_event_history (dbname, eventname, starttime, randno) 
 #Job namevalues(database(),'ev2', v_starttime,v_randno); 
 begin 
 #Exception handling section declare continue handler for sqlexception 
 begin 
  set v_error = 1; 
  get diagnostics condition 1 r_code = returned_sqlstate , r_msg = message_text; 
 end; 
 #This is the actual user program procedure called call sp_unique(2); 
 end; 
 update t_event_history set endtime=now(3),issuccess=isnull(v_error),duration=timestampdiff(microsecond,starttime,now(3)), errormessage=concat('error=',r_code,', message=',r_msg),randno=null where starttime=v_starttime and randno=v_randno; 
end
// 
create event ev3 on schedule at current_timestamp + interval 1 hour on completion preserve disable do 
begin
 declare r_code char(5) default '00000'; 
 declare r_msg text; 
 declare v_error integer; 
 declare v_starttime datetime default now(3); 
 declare v_randno integer default floor(rand()*100001); 
 insert into t_event_history (dbname, eventname, starttime, randno) 
 #Job namevalues(database(),'ev3', v_starttime,v_randno); 
 begin 
 #Exception handling section declare continue handler for sqlexception 
 begin 
  set v_error = 1; 
  get diagnostics condition 1 r_code = returned_sqlstate , r_msg = message_text; 
 end; 
 #This is the actual user program procedure called call sp_unique(3); 
 end; 
 update t_event_history set endtime=now(3),issuccess=isnull(v_error),duration=timestampdiff(microsecond,starttime,now(3)), errormessage=concat('error=',r_code,', message=',r_msg),randno=null where starttime=v_starttime and randno=v_randno; 
end
// 
create event ev4 on schedule at current_timestamp + interval 1 hour on completion preserve disable do 
begin
 declare r_code char(5) default '00000'; 
 declare r_msg text; 
 declare v_error integer; 
 declare v_starttime datetime default now(3); 
 declare v_randno integer default floor(rand()*100001); 
 insert into t_event_history (dbname, eventname, starttime, randno) 
 #Job namevalues(database(),'ev4', v_starttime,v_randno); 
 begin 
 #Exception handling section declare continue handler for sqlexception 
 begin 
  set v_error = 1; 
  get diagnostics condition 1 r_code = returned_sqlstate , r_msg = message_text; 
 end; 
 #This is the actual user program procedure called call sp_unique(4); 
 end; 
 update t_event_history set endtime=now(3),issuccess=isnull(v_error),duration=timestampdiff(microsecond,starttime,now(3)), errormessage=concat('error=',r_code,', message=',r_msg),randno=null where starttime=v_starttime and randno=v_randno; 
end
//

In order to record the execution time of each event, the logic of the operation log table is added to the event definition. Because only one insert and one update are executed in each event, a total of 8 very simple statements are executed for the 4 events, and the impact on the test is negligible. Execution time is accurate to milliseconds.

• Trigger event execution

mysql -vvv -u root -p123456 test -e "truncate t_target;alter event ev1 on schedule at current_timestamp enable;alter event ev2 on schedule at current_timestamp enable;alter event ev3 on schedule at current_timestamp enable;alter event ev4 on schedule at current_timestamp enable;"

This command line triggers four events in sequence, but it does not wait for the previous one to be completed before executing the next one. Instead, it executes the next one immediately. This can be clearly seen from the output of the command:

[mysql@hdp2~]$mysql -vvv -u root -p123456 test -e "truncate t_target;alter event ev1 on schedule at current_timestamp enable;alter event ev2 on schedule at current_timestamp enable;alter event ev3 on schedule at current_timestamp enable;alter event ev4 on schedule at current_timestamp enable;"
mysql: [Warning] Using a password on the command line interface can be insecure.
--------------
truncate t_target
--------------
Query OK, 0 rows affected (0.06 sec)
--------------
alter event ev1 on schedule at current_timestamp enable
--------------
Query OK, 0 rows affected (0.02 sec)
--------------
alter event ev2 on schedule at current_timestamp enable
--------------
Query OK, 0 rows affected (0.00 sec)
--------------
alter event ev3 on schedule at current_timestamp enable
--------------
Query OK, 0 rows affected (0.02 sec)
--------------
alter event ev4 on schedule at current_timestamp enable
--------------
Query OK, 0 rows affected (0.00 sec)
Bye
[mysql@hdp2~]$

• View event execution log

mysql> select * from test.t_event_history;
+--------+-----------+-------------------------+-------------------------+-----------+---------+--------------+--------+
| dbname | eventname | starttime | endtime | issuccess | duration | errormessage | randno |
+--------+-----------+-------------------------+-------------------------+-----------+---------+--------------+--------+
| test | ev1 | 2019-07-31 14:38:04.000 | 2019-07-31 14:38:09.389 | 1 | 5389000 | NULL | NULL |
| test | ev2 | 2019-07-31 14:38:04.000 | 2019-07-31 14:38:09.344 | 1 | 5344000 | NULL | NULL |
| test | ev3 | 2019-07-31 14:38:05.000 | 2019-07-31 14:38:09.230 | 1 | 4230000 | NULL | NULL |
| test | ev4 | 2019-07-31 14:38:05.000 | 2019-07-31 14:38:09.344 | 1 | 4344000 | NULL | NULL |
+--------+-----------+-------------------------+-------------------------+-----------+---------+--------------+--------+
4 rows in set (0.00 sec)

It can be seen that each process takes 4.83 seconds to execute. Because they are executed in parallel, the total execution time is the slowest 5.3 seconds. The optimization effect is almost the same as the shell background process method.

Summarize

The above is the operation method that I introduced to you to optimize MySQL deduplication operation to the extreme. I hope it will be helpful to you. If you have any questions, please leave me a message and I will reply to you in time. I would also like to thank everyone for their support of the 123WORDPRESS.COM website!
If you find this article helpful, please feel free to reprint it and please indicate the source. Thank you!

You may also be interested in:
  • Summary of three deduplication methods in SQL
  • Detailed example of using the distinct method in MySQL
  • A simple method to merge and remove duplicate MySQL tables
  • MySQL deduplication methods
  • Detailed explanation of two methods of deduplication in MySQL and example code
  • SQL Learning Notes 5: How to remove duplicates and assign values ​​to newly added fields
  • Summary of SQL deduplication methods

<<:  How to change the host name in Linux

>>:  Vue realizes click flip effect

Recommend

How to quickly build ELK based on Docker

[Abstract] This article quickly builds a complete...

Native JavaScript to achieve skinning

The specific code for implementing skinning with ...

How to deploy DoNetCore to Alibaba Cloud with Nginx

Basic environment configuration Please purchase t...

Practical example of Vue virtual list

Table of contents Preface design accomplish summa...

Detailed explanation of MySQL 8.0 password expiration policy

Starting from MySQL 8.0.16, you can set a passwor...

A designer complains about Hammer's official website again

Last year, the open letter was a huge hit, even a...

Linux C log output code template sample code

Preface This article mainly introduces the releva...

SQL implementation of LeetCode (182. Duplicate mailboxes)

[LeetCode] 182.Duplicate Emails Write a SQL query...

Implementation of Nginx filtering access logs of static resource files

Messy log Nginx in daily use is mostly used as bo...

Vue page monitoring user preview time function implementation code

A recent business involves such a requirement tha...

Detailed explanation of the steps to build a Vue project with Vue-cli

First you need to install Vue-cli: npm install -g...

How to inherit CSS line-height

How is Line-height inherited?Write a specific val...

Detailed explanation of common methods of JavaScript arrays

Table of contents Common array methods pop() unsh...