MySQL optimization tips: analysis of duplicate removal implementation methods [millions of data]

MySQL optimization tips: analysis of duplicate removal implementation methods [millions of data]

This article uses an example to illustrate how to remove duplicate entries in MySQL optimization tips. Share with you for your reference, the details are as follows:

When it comes to deduplication, my head can't help but hurt, especially when the output volume is large. After all, I am not a professional DB, so I can only come up with a deduplication method that suits me.

First, follow the normal first paragraph, use the having function to check for duplicates, and then delete them one by one. Don’t ask me how to write the SQL for having to detect duplicates, you know it. . . This is fine when there are only a few duplicates. What if there are thousands or tens of thousands of different data duplicates? . .

When we are done, let's consider that when we use the having function to query, the original SQL is as follows:

select `name`,count(*) as count from sentence group by `name` having count>1

You can run it and feel it, when there are about 5 million data, the speed is sesame oil. . .

However, without considering optimizing its running speed, let's consider using this statement to remove duplicate items from the array. First of all, we need to know what data we need to use when deleting duplicate data. The id is for sure, and then the filtering conditions, right? So, what is missing from the data queried by the SQL above? ID. . . Then let's add this field and try it out:

select id,`name`,count(*) as count from sentence group by `name` having count>1

As a result, the three fields id, name, and count will be displayed. You can run it yourself to see the specific effect. Based on this data, we can remove duplicates.

The specific SQL design is to delete the id that is not the id we queried, but the name value is the value we queried. Because our data is not one, it has to be processed by the program. All ids are concatenated with commas, and the name value is processed with quotation marks and commas. After that, it can be used. The example is below:

delete from sentence where id not in(1,2,3) and name in('a','b','c')

If there is too much data, we can write it into a script and then perform the operation in batches when we are done. Hey, now we can remove duplicates, but the speed is still a problem. Next, we need to consider how to optimize this SQL to increase its speed, and then we will be done.

Since we are going to improve the running speed of our SQL, according to common sense, the first thing we should think of is the index. Okay, without further ado, let's create an index first. But what about creating that field? ? ? This is another problem.

This principle is based on the fields that can be distinguished by your name field. For example, my name field stores brand names, and then I have an industry field to store the industry of each brand, so I created an index on my industry field. Of course, there are more suitable ones, it depends on how you think about it. Without further ado, let's take a look at our optimized sql:

select id,`name`,count(*) as count from sentence where industry = 'Drinks' group by `name` having count>1

The results are as follows:

What does the result show? It shows that our index is in use. . . As for the speed, you should all understand it even without me telling you. After that, we can use the program to concatenate all the IDs with commas, and process the name values ​​with quotes and commas. After that, we can proceed to the task of removing duplicates. The efficiency has increased significantly. . .

However, some readers may use conditions that contain likes and other conditions that will invalidate the index. In that case, we can simply classify the data, check the data in each category separately, and after all the data are queried, we can use the program to check for duplicates and remove the data needed for deletion.

Appendix: MySQL RAND query optimization & random query optimization

When it comes to random query, everyone knows to use the rand() function, but when the amount of data reaches a certain level, the query efficiency can be imagined. so what? Let's consider optimizing this query.

My optimization method is to use a program to randomize and then use limit to get the value. Let’s record the general idea.

First, query the number of data that meets the conditions, then use PHP's rand function to randomly select a value within this numerical range, and then query it directly.

Example sql:

select count(*) from test where $where; )

$offset = rand(0,$count)

select * from test where $where limit $offset,1; (query the required data)

Everyone can try it. With about five million pieces of data, the query speed is at least ten times faster than the MySQL RAND function query method.

Okay, that’s all for this sharing.

Readers who are interested in more MySQL-related content can check out the following topics on this site: "MySQL query skills", "MySQL common functions summary", "MySQL log operation skills", "MySQL transaction operation skills summary", "MySQL stored procedure skills" and "MySQL database lock related skills summary"

I hope this article will be helpful to everyone's MySQL database design.

You may also be interested in:
  • Analysis of MySQL: single table distinct, multi-table group by query to remove duplicate records
  • mysql SELECT statement to remove duplicate information in a field
  • Detailed example of removing duplicate data in MySQL
  • One sql statement completes MySQL deduplication and keeps one
  • MySQL deduplication methods
  • MySQL development skills: JOIN update and data duplication check/deduplication
  • Mysql delete duplicate data Mysql data deduplication
  • Detailed explanation of two methods of deduplication in MySQL and example code
  • A simple method to merge and remove duplicate MySQL tables
  • How to optimize MySQL deduplication operation to the extreme
  • Some notes on mysql self-join deduplication

<<:  Vue development tree structure components (component recursion)

>>:  How to use regular expressions to automatically match wildcard domain names in nginx

Recommend

Detailed tutorial on installing Docker on CentOS 7.5

Introduction to Docker Docker is an open source c...

The iframe frame sets the white background to transparent in IE browser

Recently, I need to frequently use iframe to draw ...

Detailed usage of MYSQL row_number() and over() functions

Syntax format: row_number() over(partition by gro...

Several scenarios for using the Nginx Rewrite module

Application scenario 1: Domain name-based redirec...

Detailed explanation of the underlying encapsulation of Java connection to MySQL

This article shares the Java connection MySQL und...

Detailed explanation of the use of Vue Smooth DnD, a draggable component of Vue

Table of contents Introduction and Demo API: Cont...

Lombok implementation JSR-269

Preface Introduction Lombok is a handy tool, just...

A brief analysis of the difference between static and self in PHP classes

Use self:: or __CLASS__ to get a static reference...

ReactRouter implementation

ReactRouter implementation ReactRouter is the cor...

Solutions to MySql crash and service failure to start

I have been in contact with PHP for so long, but ...