Summary of SQL deduplication methods

Summary of SQL deduplication methods

When using SQL to extract data, we often encounter duplicate values ​​in the table. For example, if we want to get UV (unique visitors), we need to deduplicate.

In MySQL, distinct or group by clause is usually used, but in SQLs that support window functions (such as Hive SQL , Oracle , etc.), the ROW_NUMBER window function can also be used for deduplication.

For example, there is a table task like this:

Remark:

  • task_id : task id;
  • order_id : order id;
  • start_time : start time

Note : One task corresponds to multiple orders

We need to find the total number of tasks. Since task_id is not unique, we need to remove duplicates:

distinct

 -- List all unique values ​​of task_id (after deduplication)

select distinct task_id
from Task;

--Total number of tasks select count(distinct task_id) task_num
from Task;


distinct is usually less efficient. It is not suitable for displaying specific values ​​after deduplication, and is generally used together with count to calculate the number of entries.
When distinct is used, it is placed after select to deduplicate the values ​​of all the subsequent fields. For example, if there are two fields after distinct , then the two records 1,1 and 1,2 are not duplicate values.

group by

 -- List all unique values ​​of task_id (after deduplication, null is also a value)
-- select task_id
-- from Task
-- group by task_id;

--Total number of tasks select count(task_id) task_num
from (select task_id
   from Task
   group by task_id) tmp;

row_number

row_number is a window function with the following syntax:

row_number() over (partition by <用于分組的字段名> order by <用于組內排序的字段名>)
partition by part can be omitted.

 -- Use select count(case when rn=1 then task_id else null end) task_num in SQL that supports window functions
from (select task_id
    , row_number() over (partition by task_id order by start_time) rn
  from Task) tmp;

In addition, let's use a table test to explain the use of distinct and group by in deduplication:

 -- The semicolon below is used to separate rows select distinct user_id
from Test; -- returns 1; 2

select distinct user_id, user_type
from Test; -- returns 1, 1; 1, 2; 2, 1

select user_id
from Test
group by user_id; -- returns 1; 2

select user_id, user_type
from Test
group by user_id, user_type; -- returns 1, 1; 1, 2; 2, 1

select user_id, user_type
from Test
group by user_id; 
  -- Hive, Oracle, etc. will report an error, but MySQL can be written like this.
-- Returns 1, 1 or 1, 2; 2, 1 (two rows in total). Only the fields after group by will be deduplicated, which means the number of records returned at the end is equal to the number of records in the previous SQL statement, that is, 2 records. For fields that are not placed after group by but are placed in select, only one record will be returned (usually the first one, but there should be no pattern).

This is the end of this article on the summary of SQL deduplication methods. For more relevant SQL deduplication methods, please search for previous articles on 123WORDPRESS.COM or continue to browse the following related articles. I hope everyone will support 123WORDPRESS.COM in the future!

You may also be interested in:
  • Summary of three deduplication methods in SQL
  • Detailed example of using the distinct method in MySQL
  • How to optimize MySQL deduplication operation to the extreme
  • A simple method to merge and remove duplicate MySQL tables
  • MySQL deduplication methods
  • Detailed explanation of two methods of deduplication in MySQL and example code
  • SQL Learning Notes 5: How to remove duplicates and assign values ​​to newly added fields

<<:  Share 10 of the latest web front-end frameworks (translation)

>>:  Pure CSS to achieve hover image pop-out pop-up effect example code

Recommend

Detailed explanation of Linux mpstat command usage

1. mpstat command 1.1 Command Format mpstat [ -A ...

Solution to the gap between divs

When you use HTML div blocks and the middle of th...

Troubleshooting and solutions for MySQL auto-increment ID oversize problem

introduction Xiao A was writing code, and DBA Xia...

Vue3 implements CSS infinite seamless scrolling effect

This article example shares the specific code of ...

Solution to the problem of repeated triggering of functions in Vue project watch

Table of contents Problem description: Solution 1...

Detailed steps for using AES.js in Vue

Use of AES encryption Data transmission encryptio...

How to set a fixed IP in Linux (tested and effective)

First, open the virtual machine Open xshell5 to c...

The whole process of installing gogs with pagoda panel and docker

Table of contents 1 Install Docker in Baota Softw...

Nginx defines domain name access method

I'm building Nginx recently, but I can't ...

How to use less in WeChat applet (optimal method)

Preface I am used to writing less/sass, but now I...

Docker container custom hosts network access operation

Adding the extra_hosts keyword in docker-compose....

CSS flex several multi-column layout

Basic three-column layout .container{ display: fl...

8 powerful techniques for HTML web page creation

<br />Although there are many web page creat...