Implementing file content deduplication and intersection and difference in Linux

Implementing file content deduplication and intersection and difference in Linux

1. Data Deduplication

In daily work, there may be data duplication when using Hive or Impala to query and export, but you don’t want to re-execute the query (the query time is a bit long and the exported file content is large), so you think of using Linux commands to remove duplicate data from the file content.

The following is an example:

You can see that aaa.txx has 3 duplicate data

I want to remove the redundant data and keep only one

sort aaa.txt | uniq > bbb.txt

Remove duplicate data from the aaa.txt file and output it to bbb.txt

You can see that only one piece of data is retained in the bbb.txt file

2. Data intersection, union, and difference

1) Intersection (equivalent to user_2019 inner join user_2020 on user_2019.user_no=user_2020.user_no)

sort user_2019.txt user_2020.txt | uniq -d

2) Union (equivalent to user_2019.user_no union user_2020.user_no)

sort user_2019.txt user_2020.txt | uniq

3) Difference

user_2019.txt-user_2020.txt

sort user_2019.txt user_2020.txt user_2020.txt | uniq -u

user_2020.txt - user_2019.txt:

sort user_2020.txt user_2019.txt user_2019.txt | uniq -u

The above is the full content of this article. I hope it will be helpful for everyone’s study. I also hope that everyone will support 123WORDPRESS.COM.

You may also be interested in:
  • How to detect file system integrity based on AIDE in Linux
  • Detailed explanation of commands to read and write remote files using Vim in Linux system
  • Detailed explanation of various practical uses of virtual device files in Linux system
  • Solution to the "No such file or directory" prompt when executing executable files in Linux
  • How to quickly copy large files under Linux
  • Detailed explanation of the problem that the space is not released after the Linux file is deleted
  • Linux file management command example analysis [display, view, statistics, etc.]

<<:  In-depth understanding of MySQL long transactions

>>:  js to realize a simple disc clock

Recommend

How to select all child elements and add styles to them in CSS

method: Take less in the actual project as an exa...

A brief analysis of MySQL's WriteSet parallel replication

【Historical Background】 I have been working as a ...

Share 13 excellent web wireframe design and production tools

When you start working on a project, it’s importa...

Example of how to change the domestic source in Ubuntu 18.04

Ubuntu's own source is from China, so the dow...

How to use jconsole to monitor remote Tomcat services

What is JConsole JConsole was introduced in Java ...

MySQL database constraints and data table design principles

Table of contents 1. Database constraints 1.1 Int...

Summary of 4 solutions for returning values ​​on WeChat Mini Program pages

Table of contents Usage scenarios Solution 1. Use...

Detailed explanation of CocosCreator message distribution mechanism

Overview This article begins to introduce content...

CSS to achieve horizontal lines on both sides of the middle text

1. The vertical-align property achieves the follo...

Summary of fragmented knowledge of Docker management

Table of contents 1. Overview 2. Application Exam...

Implementation of LNMP for separate deployment of Docker containers

1. Environmental Preparation The IP address of ea...

How to change the character set encoding to UTF8 in MySQL 5.5/5.6 under Linux

1. Log in to MySQL and use SHOW VARIABLES LIKE &#...

Detailed explanation of component development of Vue drop-down menu

This article example shares the specific code for...

Determine whether MySQL update will lock the table through examples

Two cases: 1. With index 2. Without index Prerequ...