How to filter out duplicate data when inserting large amounts of data into MySQL

How to filter out duplicate data when inserting large amounts of data into MySQL

Preface:

The reason for working overtime is to go online and solve the problem of duplicate data in the online database. bug in the program was discovered, which was easily solved. The problem was a bit severe in correcting the duplicate data online.

There are duplicate data in 6 tables in the online library, 2 of which are relatively large, one with more than 960,000 and the other with more than 300,000. Because I have dealt with the same problem before, I directly used the Python deduplication script from last time. The script is very simple, which is to connect to the database, find out the duplicate data, and delete it in a loop.

Emmmm, but the efficiency is too low, one piece per second, about 20,000+ duplicate data, and the estimated time is about 8 hours. . .

There is a problem in blindly relying on the things of predecessors without thinking for yourself! Always wondering why it worked before and why it doesn’t work now is also a problem! I found that I have been in a bad state recently and have lost the desire to explore and seek knowledge. Today is a wake-up call for me and I feel like I have finally found my way back to the right path.

Now, let's get back to the topic. The deduplication steps are introduced in detail below.

1. Discover the problem

CREATE TABLE `animal` (  
  `id` int(11) NOT NULL AUTO_INCREMENT,  
  `name` varchar(20) DEFAULT NULL,  
  `age` int(11) DEFAULT NULL,  
  PRIMARY KEY (`id`)  
) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8 COLLATE=utf8_bin;  
INSERT INTO `pilipa_dds`.`student` (`id`, `name`, `age`) VALUES ('1', 'cat', '12');  
INSERT INTO `pilipa_dds`.`student` (`id`, `name`, `age`) VALUES ('2', 'dog', '13');  
INSERT INTO `pilipa_dds`.`student` (`id`, `name`, `age`) VALUES ('3', 'camel', '25');  
INSERT INTO `pilipa_dds`.`student` (`id`, `name`, `age`) VALUES ('4', 'cat', '32');  
INSERT INTO `pilipa_dds`.`student` (`id`, `name`, `age`) VALUES ('5', 'dog', '42'); 


Goal: We want to remove data with the same name .

First, see which data is repeated.

SELECT name,count( 1 )   
FROM  
 student   
GROUP BY  
NAME   
HAVING  
 count( 1 ) > 1; 


Output:

name count(1) cat 2 dog 2

The data name cat and dog are repeated, and each repeated data has two entries;

Select * From Table Where Repeating Field In (Select Repeating Field From Table Group By Repeating Field Having Count(1)>1) 

2. Delete all duplicate data without leaving any

Direct deletion will result in an error.

DELETE   
FROM  
 student   
WHERE  
 NAME IN (  
 SELECT NAME   
 FROM  
  student   
 GROUP BY  
 NAME   
HAVING  
 count( 1 ) > 1) 


Error:

1093 - You can't specify target table 'student' for update in FROM clause, Time: 0.016000s 


The reason is: while updating this table, this table is queried, while querying this table, this table is updated again, which can be understood as a deadlock. MySQL does not support this operation of updating and querying the same table.

Solution: Query the columns of data to be updated as a third-party table, and then filter and update them.

DELETE   
FROM  
 student   
WHERE  
 NAME IN (  
 SELECT  
  t.NAME   
FROM  
 ( SELECT NAME FROM student GROUP BY NAME HAVING count( 1 ) > 1 ) t) 

3. Delete duplicate data from the delete table and keep only one

Before deleting, we can check what kind of duplicate data we want to delete.

SELECT  
 *   
FROM  
 student   
WHERE  
 id NOT IN (  
 SELECT  
  t.id   
 FROM  
 ( SELECT MIN( id ) AS id FROM student GROUP BY `name` ) t   
 ) 


What does this mean? First, we group by name and find the data with the smallest id. These data are the fire seeds we want to keep. Then, we query the data with ids that are not in it, which are the duplicate data we want to delete.

4. Start deleting duplicate data and keep only one

It's very simple. Just replace select delete .

DELETE   
FROM  
 student   
WHERE  
 id NOT IN (  
 SELECT  
  t.id   
 FROM  
 ( SELECT MIN( id ) AS id FROM student GROUP BY `name` ) t   
 ) 


The execution of a table with 900,000+ records is super fast.

This is the end of this article about how to filter out duplicate data during MySQL bulk inserts. For more information about how to filter out duplicate data in MySQL, please search for previous articles on 123WORDPRESS.COM or continue to browse the following related articles. I hope you will support 123WORDPRESS.COM in the future!

You may also be interested in:
  • MYSQL Must Know Reading Notes Chapter 6 Filtering Data
  • MYSQL uses regular expressions to filter data
  • Comparison of storage engines supported by MySQL database
  • MySQL helps you understand index pushdown in seconds
  • Python numpy implements multiple loops to read files and filter data at equal intervals
  • Explain in detail how to remove or filter certain values ​​or rows in a data set in pandas?
  • How to filter data using orderBy and filter in angularJs

<<:  Responsive Web Design Learning (3) - How to improve the performance of web pages on mobile devices

>>:  base target="" controls the link's target open frame

Recommend

Vue's Render function

Table of contents 1. Nodes, trees, and virtual DO...

How to use iostat to view Linux hard disk IO performance

TOP Observation: The percentage of CPU time occup...

Vue implements multi-grid input box on mobile terminal

Recently, the company has put forward a requireme...

CSS overflow-wrap new property value anywhere usage

1. First, understand the overflow-wrap attribute ...

Count the list tags in HTML

1. <dl> defines a list, <dt> defines ...

Deep understanding of the use of ::before/:before and ::after/:after

Part 1: Basics 1. Unlike pseudo-classes such as :...

JavaScript to implement the function of changing avatar

This article shares the specific code of JavaScri...

Detailed explanation of basic management of KVM virtualization in CentOS7

1. Install kvm virtualization : : : : : : : : : :...

MySQL index knowledge summary

The establishment of MySQL index is very importan...

Detailed explanation of MLSQL compile-time permission control example

Preface The simple understanding of MySQL permiss...

Implementation process of nginx high availability cluster

This article mainly introduces the implementation...

What to do if you forget the initial password of MySQL on MAC

The solution to forgetting the initial password o...