Implementing file content deduplication and intersection and difference in Linux

Implementing file content deduplication and intersection and difference in Linux

1. Data Deduplication

In daily work, there may be data duplication when using Hive or Impala to query and export, but you don’t want to re-execute the query (the query time is a bit long and the exported file content is large), so you think of using Linux commands to remove duplicate data from the file content.

The following is an example:

You can see that aaa.txx has 3 duplicate data

I want to remove the redundant data and keep only one

sort aaa.txt | uniq > bbb.txt

Remove duplicate data from the aaa.txt file and output it to bbb.txt

You can see that only one piece of data is retained in the bbb.txt file

2. Data intersection, union, and difference

1) Intersection (equivalent to user_2019 inner join user_2020 on user_2019.user_no=user_2020.user_no)

sort user_2019.txt user_2020.txt | uniq -d

2) Union (equivalent to user_2019.user_no union user_2020.user_no)

sort user_2019.txt user_2020.txt | uniq

3) Difference

user_2019.txt-user_2020.txt

sort user_2019.txt user_2020.txt user_2020.txt | uniq -u

user_2020.txt - user_2019.txt:

sort user_2020.txt user_2019.txt user_2019.txt | uniq -u

The above is the full content of this article. I hope it will be helpful for everyone’s study. I also hope that everyone will support 123WORDPRESS.COM.

You may also be interested in:
  • How to detect file system integrity based on AIDE in Linux
  • Detailed explanation of commands to read and write remote files using Vim in Linux system
  • Detailed explanation of various practical uses of virtual device files in Linux system
  • Solution to the "No such file or directory" prompt when executing executable files in Linux
  • How to quickly copy large files under Linux
  • Detailed explanation of the problem that the space is not released after the Linux file is deleted
  • Linux file management command example analysis [display, view, statistics, etc.]

<<:  In-depth understanding of MySQL long transactions

>>:  js to realize a simple disc clock

Recommend

How to completely delete the MySQL 8.0 service under Linux

Before reading this article, it is best to have a...

Tutorial on building file sharing service Samba under CentOS6.5

Samba Services: This content is for reference of ...

CSS3 click button circular progress tick effect implementation code

Table of contents 8. CSS3 click button circular p...

Analysis of MySQL data backup and recovery implementation methods

This article uses examples to describe how to bac...

How to use CSS3 to implement a queue animation similar to online live broadcast

A friend in the group asked a question before, th...

Detailed explanation of desktop application using Vue3 and Electron

Table of contents Vue CLI builds a Vue project Vu...

Summary of changes in the use of axios in vue3 study notes

Table of contents 1. Basic use of axio 2. How to ...

Detailed explanation of the use of filter properties in CSS

The filter attribute defines the visual effect of...

Detailed explanation of JavaScript axios installation and packaging case

1. Download the axios plugin cnpm install axios -...

How to manually encapsulate paging components in Vue3.0

This article shares the specific code of the vue3...

Win10 installation Linux system tutorial diagram

To install a virtual machine on a Windows system,...

Vue recursively implements custom tree components

This article shares the specific code of Vue recu...

Vue implements the shake function (compatible with ios13.3 and above)

Recently, I made a function similar to shake, usi...