How to find and delete duplicate rows in MySQL

How to find and delete duplicate rows in MySQL

1. How to find duplicate rows

The first step is to define what counts as a duplicate row. In most cases this is simple: they have the same value in one column. This article uses this definition. Perhaps your definition of "duplicate" is more complicated than this, and you need to make some modifications to the SQL.

The data samples used in this article are:

create table test(id int not null primary key, day date not null);  

insert into test(id, day) values(1, '2006-10-08');  
insert into test(id, day) values(2, '2006-10-08');  
insert into test(id, day) values(3, '2006-10-09');  

select * from test;  
+----+------------+  
| id | day |  
+----+------------+  
| 1 | 2006-10-08 |  
| 2 | 2006-10-08 |  
| 3 | 2006-10-09 |  
+----+------------+

The first two rows have the same value in day field, so if I treat them as duplicates, here is a query to find them. The query uses the GROUP BY clause to group rows with the same field values ​​and then calculates the size of the group.

select day, count(*) from test GROUP BY day;  
+------------+----------+  
| day | count(*) |  
+------------+----------+  
| 2006-10-08 | 2 |  
| 2006-10-09 | 1 |  
+------------+----------+


The group size for duplicate rows is greater than 1. If you want to display only duplicate rows, you must use the HAVING clause, such as

select day, count(*) from test group by day HAVING count(*) > 1;  
+------------+----------+  
| day | count(*) |  
+------------+----------+  
| 2006-10-08 | 2 |  
+------------+----------+

This is the basic technique: group by a field with the same value, then display groups with a size greater than 1.

Why can't I use the WHERE clause? Because the WHERE clause filters the rows before grouping, and the HAVING clause filters the rows after grouping.

2. How to delete duplicate rows

A related question is how to remove duplicate rows. A common task is to keep only one row of duplicates and delete the others. You can then create appropriate indexes to prevent duplicate rows from being written to the database in the future.

Again, the first thing to do is to figure out what a duplicate row is. Which line do you want to keep? The first row, or the row with the maximum value for a certain field? In this article, it is assumed that you want to keep the first row - the row with the smallest value in the id field, which means you want to delete the other rows.

Perhaps the easiest way to do this is via a temporary table. Especially for MYSQL , there are some restrictions that you cannot update a table while select in a query statement. For simplicity, only the temporary table method is used here.

Our task is to remove all duplicate rows except the row with the minimum value of id field in the group. Therefore, we need to find the groups whose size is greater than 1 and the rows we want to keep. You can use the MIN() function. The statements here create a temporary table and find the rows that need to be deleted using DELETE .

create temporary table to_delete (day date not null, min_id int not null);  

insert into to_delete(day, min_id)  
  select day, MIN(id) from test group by day having count(*) > 1;  

select * from to_delete;  
+------------+--------+  
| day | min_id |  
+------------+--------+  
| 2006-10-08 | 1 |  
+------------+--------+



With this data in hand, you can start deleting the "dirty" rows. There are several approaches, each with its own advantages and disadvantages, but I won't compare them in detail here, except to explain the standard approach used in relational databases that support query clauses.

delete from test  
  where exists(  
     select * from to_delete  
     where to_delete.day = test.day and to_delete.min_id <> test.id  
  )

3. How to find duplicates on multiple columns

Someone recently asked this question: I have two fields b and c in one of my tables, which are respectively associated with the b and c fields of two other tables. I want to find out the rows that have duplicate values ​​on either the b field or the c field.

It was hard to understand at first, but after a conversation I understood: he wanted to create unique indexes for b and c respectively. As mentioned above, finding rows with duplicate values ​​for a field is as simple as grouping them with group and then calculating the size of the group. And it is also very simple to find rows with duplicate fields, just put all the fields into the group clause. However, if the problem is to determine whether field b is repeated or field c is repeated, the problem is much more difficult. The sample data used by the questioner here

create table a_b_c(  
  a int not null primary key auto_increment,  
  b int,  
  c int  
);  

insert into a_b_c(b,c) values ​​(1, 1);  
insert into a_b_c(b,c) values ​​(1, 2);  
insert into a_b_c(b,c) values ​​(1, 3);  
insert into a_b_c(b,c) values ​​(2, 1);  
insert into a_b_c(b,c) values ​​(2, 2);  
insert into a_b_c(b,c) values ​​(2, 3);  
insert into a_b_c(b,c) values ​​(3, 1);  
insert into a_b_c(b,c) values ​​(3, 2);  
insert into a_b_c(b,c) values ​​(3, 3);



Now, you can easily see that there are some duplicate rows in the table, but no two rows have the same tuple {b, c}. That's why the problem becomes difficult.

4. Incorrect query statements

If you group both columns together you will get different results depending on how you group and calculate the size. This is exactly where the questioner is stuck. Sometimes a query finds some duplicate rows but misses others. This is the query he used

select b, c, count(*) from a_b_c  
group by b, c  
having count(distinct b > 1)  
  or count(distinct c > 1);

The result returns all rows because CONT(*) is always 1. Why? Because >1 is written inside COUNT() . This error is easily overlooked and is in fact equivalent to

select b, c, count(*) from a_b_c  
group by b, c  
having count(1)  
  or count(1);

Why? Because (b > 1) is a boolean value, which is not what you want at all. What you want is

select b, c, count(*) from a_b_c  
group by b, c  
having count(distinct b) > 1  
  or count(distinct c) > 1;

Returns empty result. Obviously, since there are no duplicates of {b,c} . The person tried many other combinations of OR and AND , using one field for grouping and another field for calculating the size, like this

select b, count(*) from a_b_c group by b having count(distinct c) > 1;  
+------+----------+  
| b | count(*) |  
+------+----------+  
| 1 | 3 |  
| 2 | 3 |  
| 3 | 3 |  
+------+----------+

None of them can find all the duplicate rows. And the most frustrating thing is that for some situations, this statement is valid, if you mistakenly think that this is the way to write it, but for other situations, you may get the wrong result.

In fact, simply using GROUP BY is not feasible. Why? Because when you use group by on a certain field, the values ​​of another field will be dispersed into different groups. Sorting on these fields can see these effects, just as grouping does. First, sort the b field to see how it is grouped.

When you sort (group) the b field, the same value of c is divided into different groups, so you cannot用COUNT(DISTINCT c) to calculate the size. Internal functions such as COUNT () only work on the same group and cannot do anything for rows in different groups. Similarly, if the c field is sorted, b with the same value will be divided into different groups, which will not achieve our goal anyway.

5. Several correct methods

Perhaps the simplest way is to find duplicate rows for each field individually, then combine them using UNION, like this:

select b as value, count(*) as cnt, 'b' as what_col  
from a_b_c group by b having count(*) > 1  
union  
select c as value, count(*) as cnt, 'c' as what_col  
from a_b_c group by c having count(*) > 1;  
+-------+-----+----------+  
| value | cnt | what_col |  
+-------+-----+----------+  
| 1 | 3 | b |  
| 2 | 3 | b |  
| 3 | 3 | b |  
| 1 | 3 | c |  
| 2 | 3 | c |  
| 3 | 3 | c |  
+-------+-----+----------+

The what_col field is output to indicate which field is being repeated. Another approach is to use nested queries:

select a, b, c from a_b_c  
where b in (select b from a_b_c group by b having count(*) > 1)  
   or c in (select c from a_b_c group by c having count(*) > 1);  
+----+------+------+  
| a | b | c |  
+----+------+------+  
| 7 | 1 | 1 |  
| 8 | 1 | 2 |  
| 9 | 1 | 3 |  
| 10 | 2 | 1 |  
| 11 | 2 | 2 |  
| 12 | 2 | 3 |  
| 13 | 3 | 1 |  
| 14 | 3 | 2 |  
| 15 | 3 | 3 |  
+----+------+------+

This approach is much less efficient than using UNION and displays each repeated row rather than the repeated field values. Another method is to query the nested query results of yourself and group in a joint table. The writing method is relatively complicated, but it is necessary for complex data or situations with high requirements for efficiency.

 select a, a_b_c.b, a_b_c.c  
from a_b_c  
  left outer join (  
     select b from a_b_c group by b having count(*) > 1  
  ) as b on a_b_c.b = bb  
  left outer join (  
     select c from a_b_c group by c having count(*) > 1  
  ) as c on a_b_c.c = cc  
where bb is not null or cc is not null

The above methods work, and I'm sure there are others. If UNION can be used, I think it will be the simplest.

This is the end of this article about how to find and delete duplicate rows in MySQL. For more information about how to find and delete duplicate rows in MySQL, please search for previous articles on 123WORDPRESS.COM or continue to browse the following related articles. I hope you will support 123WORDPRESS.COM in the future!

You may also be interested in:
  • How to delete duplicate rows in mysql
  • Some ways to eliminate duplicate rows in MySQL
  • Comparison of storage engines supported by MySQL database
  • MySQL helps you understand index pushdown in seconds

<<:  A Deeper Look at the Differences Between Link and @import

>>:  Difference between var and let in JavaScript

Recommend

Using js to implement the two-way binding function of data in Vue2.0

Object.defineProperty Understanding grammar: Obje...

How to enable Swoole Loader extension on Linux system virtual host

Special note: Only the Swoole extension is instal...

About Zabbix custom monitoring items and triggers

Table of contents 1. Monitoring port Relationship...

Simple tips to increase web page loading speed

The loading speed of a web page is an important in...

MySQL 8.0.18 stable version released! Hash Join is here as expected

MySQL 8.0.18 stable version (GA) was officially r...

Native JavaScript implementation of progress bar

The specific code for JavaScript to implement the...

Pure CSS header fixed implementation code

There are two main reasons why it is difficult to...

Solutions to the Problem of Creating XHTML and CSS Web Pages

The solutions to the problems encountered during x...

CSS easily implements fixed-ratio block-level containers

When designing H5 layout, you will usually encoun...

Detailed explanation of JavaScript prototype chain

Table of contents 1. Constructors and instances 2...

Causes and solutions for front-end exception 502 bad gateway

Table of contents 502 bad gateway error formation...

VUE + OPENLAYERS achieves real-time positioning function

Table of contents Preface 1. Define label style 2...

How to configure whitelist access in mysql

Steps to configure whitelist access in mysql 1. L...

Vue template compilation details

Table of contents 1. parse 1.1 Rules for intercep...