Implementing file content deduplication and intersection and difference in Linux

Implementing file content deduplication and intersection and difference in Linux

1. Data Deduplication

In daily work, there may be data duplication when using Hive or Impala to query and export, but you don’t want to re-execute the query (the query time is a bit long and the exported file content is large), so you think of using Linux commands to remove duplicate data from the file content.

The following is an example:

You can see that aaa.txx has 3 duplicate data

I want to remove the redundant data and keep only one

sort aaa.txt | uniq > bbb.txt

Remove duplicate data from the aaa.txt file and output it to bbb.txt

You can see that only one piece of data is retained in the bbb.txt file

2. Data intersection, union, and difference

1) Intersection (equivalent to user_2019 inner join user_2020 on user_2019.user_no=user_2020.user_no)

sort user_2019.txt user_2020.txt | uniq -d

2) Union (equivalent to user_2019.user_no union user_2020.user_no)

sort user_2019.txt user_2020.txt | uniq

3) Difference

user_2019.txt-user_2020.txt

sort user_2019.txt user_2020.txt user_2020.txt | uniq -u

user_2020.txt - user_2019.txt:

sort user_2020.txt user_2019.txt user_2019.txt | uniq -u

The above is the full content of this article. I hope it will be helpful for everyone’s study. I also hope that everyone will support 123WORDPRESS.COM.

You may also be interested in:
  • How to detect file system integrity based on AIDE in Linux
  • Detailed explanation of commands to read and write remote files using Vim in Linux system
  • Detailed explanation of various practical uses of virtual device files in Linux system
  • Solution to the "No such file or directory" prompt when executing executable files in Linux
  • How to quickly copy large files under Linux
  • Detailed explanation of the problem that the space is not released after the Linux file is deleted
  • Linux file management command example analysis [display, view, statistics, etc.]

<<:  In-depth understanding of MySQL long transactions

>>:  js to realize a simple disc clock

Recommend

React uses routing to redirect to the login interface

In the previous article, after configuring the we...

How to simply configure multiple servers in nginx

1: I won’t go into the details of how to install ...

Solution to MySQL service 1067 error: modify the mysql executable file path

Today I encountered the MySQL service 1067 error ...

How to connect a Linux virtual machine to WiFi

In life, the Internet is everywhere. We can play ...

Use docker to deploy tomcat and connect to skywalking

Table of contents 1. Overview 2. Use docker to de...

Use Docker to run multiple PHP versions on the server

PHP7 has been out for quite some time, and it is ...

Causes and solutions for cross-domain issues in Ajax requests

Table of contents 1. How is cross-domain formed? ...

Example of how to quickly build a Redis cluster with Docker

What is Redis Cluster Redis cluster is a distribu...

Detailed explanation of HTML area tag

The <area> tag defines an area in an image ...

How to use Navicat to export and import mysql database

MySql is a data source we use frequently. It is v...

Summary of Button's four Click response methods

Button is used quite a lot. Here I have sorted ou...

Pure CSS to achieve cloudy weather icon effect

Effect The effect is as follows ​ Implementation ...