Summary of SQL deduplication methods

Summary of SQL deduplication methods

When using SQL to extract data, we often encounter duplicate values ​​in the table. For example, if we want to get UV (unique visitors), we need to deduplicate.

In MySQL, distinct or group by clause is usually used, but in SQLs that support window functions (such as Hive SQL , Oracle , etc.), the ROW_NUMBER window function can also be used for deduplication.

For example, there is a table task like this:

Remark:

  • task_id : task id;
  • order_id : order id;
  • start_time : start time

Note : One task corresponds to multiple orders

We need to find the total number of tasks. Since task_id is not unique, we need to remove duplicates:

distinct

 -- List all unique values ​​of task_id (after deduplication)

select distinct task_id
from Task;

--Total number of tasks select count(distinct task_id) task_num
from Task;


distinct is usually less efficient. It is not suitable for displaying specific values ​​after deduplication, and is generally used together with count to calculate the number of entries.
When distinct is used, it is placed after select to deduplicate the values ​​of all the subsequent fields. For example, if there are two fields after distinct , then the two records 1,1 and 1,2 are not duplicate values.

group by

 -- List all unique values ​​of task_id (after deduplication, null is also a value)
-- select task_id
-- from Task
-- group by task_id;

--Total number of tasks select count(task_id) task_num
from (select task_id
   from Task
   group by task_id) tmp;

row_number

row_number is a window function with the following syntax:

row_number() over (partition by <用于分組的字段名> order by <用于組內排序的字段名>)
partition by part can be omitted.

 -- Use select count(case when rn=1 then task_id else null end) task_num in SQL that supports window functions
from (select task_id
    , row_number() over (partition by task_id order by start_time) rn
  from Task) tmp;

In addition, let's use a table test to explain the use of distinct and group by in deduplication:

 -- The semicolon below is used to separate rows select distinct user_id
from Test; -- returns 1; 2

select distinct user_id, user_type
from Test; -- returns 1, 1; 1, 2; 2, 1

select user_id
from Test
group by user_id; -- returns 1; 2

select user_id, user_type
from Test
group by user_id, user_type; -- returns 1, 1; 1, 2; 2, 1

select user_id, user_type
from Test
group by user_id; 
  -- Hive, Oracle, etc. will report an error, but MySQL can be written like this.
-- Returns 1, 1 or 1, 2; 2, 1 (two rows in total). Only the fields after group by will be deduplicated, which means the number of records returned at the end is equal to the number of records in the previous SQL statement, that is, 2 records. For fields that are not placed after group by but are placed in select, only one record will be returned (usually the first one, but there should be no pattern).

This is the end of this article on the summary of SQL deduplication methods. For more relevant SQL deduplication methods, please search for previous articles on 123WORDPRESS.COM or continue to browse the following related articles. I hope everyone will support 123WORDPRESS.COM in the future!

You may also be interested in:
  • Summary of three deduplication methods in SQL
  • Detailed example of using the distinct method in MySQL
  • How to optimize MySQL deduplication operation to the extreme
  • A simple method to merge and remove duplicate MySQL tables
  • MySQL deduplication methods
  • Detailed explanation of two methods of deduplication in MySQL and example code
  • SQL Learning Notes 5: How to remove duplicates and assign values ​​to newly added fields

<<:  Share 10 of the latest web front-end frameworks (translation)

>>:  Pure CSS to achieve hover image pop-out pop-up effect example code

Recommend

Solution to slow network request in docker container

Several problems were discovered during the use o...

Introduction to the usage of exists and except in SQL Server

Table of contents 1. exists 1.1 Description 1.2 E...

How to manually scroll logs in Linux system

Log rotation is a very common function on Linux s...

Analysis of slow insert cases caused by large transactions in MySQL

【question】 The INSERT statement is one of the mos...

Solve MySQL deadlock routine by updating different indexes

The previous articles introduced how to debug loc...

MySQL 8.0.12 installation and configuration method graphic tutorial

Record the installation and configuration method ...

How to use iostat to view Linux hard disk IO performance

TOP Observation: The percentage of CPU time occup...

Using JS to implement a small game of aircraft war

This article example shares the specific code of ...

Docker realizes the connection with the same IP network segment

Recently, I solved the problem of Docker and the ...

MySQL series 9 MySQL query cache and index

Table of contents Tutorial Series 1. MySQL Archit...

A brief discussion on Flex layout and scaling calculation

1. Introduction to Flex Layout Flex is the abbrev...

Detailed steps to install Hadoop cluster under Linux

Table of contents 1. Create a Hadoop directory in...

Detailed tutorial for installing MySQL on Linux

MySQL downloads for all platforms are available a...