Implementing file content deduplication and intersection and difference in Linux

1. Data Deduplication

In daily work, there may be data duplication when using Hive or Impala to query and export, but you don’t want to re-execute the query (the query time is a bit long and the exported file content is large), so you think of using Linux commands to remove duplicate data from the file content.

The following is an example:

You can see that aaa.txx has 3 duplicate data

I want to remove the redundant data and keep only one

sort aaa.txt | uniq > bbb.txt

Remove duplicate data from the aaa.txt file and output it to bbb.txt

You can see that only one piece of data is retained in the bbb.txt file

2. Data intersection, union, and difference

1) Intersection (equivalent to user_2019 inner join user_2020 on user_2019.user_no=user_2020.user_no)

sort user_2019.txt user_2020.txt | uniq -d

2) Union (equivalent to user_2019.user_no union user_2020.user_no)

sort user_2019.txt user_2020.txt | uniq

3) Difference

user_2019.txt-user_2020.txt
sort user_2019.txt user_2020.txt user_2020.txt | uniq -u
user_2020.txt - user_2019.txt:
sort user_2020.txt user_2019.txt user_2019.txt | uniq -u

The above is the full content of this article. I hope it will be helpful for everyone’s study. I also hope that everyone will support 123WORDPRESS.COM.

You may also be interested in:

How to detect file system integrity based on AIDE in Linux
Detailed explanation of commands to read and write remote files using Vim in Linux system
Detailed explanation of various practical uses of virtual device files in Linux system
Solution to the "No such file or directory" prompt when executing executable files in Linux
How to quickly copy large files under Linux
Detailed explanation of the problem that the space is not released after the Linux file is deleted
Linux file management command example analysis [display, view, statistics, etc.]

<<: In-depth understanding of MySQL long transactions

>>: js to realize a simple disc clock

How to set mysql permissions using phpmyadmin

Implementing file content deduplication and intersection and difference in Linux

How to set mysql permissions using phpmyadmin

How to use javascript to do simple algorithms

Comparison of the efficiency of different methods of deleting files in Linux

Docker compose deploys SpringBoot project to connect to MySQL and the pitfalls encountered

JavaScript Factory Pattern Explained

React entry-level detailed notes

How to install common components (mysql, redis) in Docker

MySQL Learning (VII): Detailed Explanation of the Implementation Principle of Innodb Storage Engine Index

Docker data storage tmpfs mounts detailed explanation

HTML table markup tutorial (43): VALIGN attribute of the table header

Recommend

How to select all child elements and add styles to them in CSS

A brief analysis of MySQL's WriteSet parallel replication

Share 13 excellent web wireframe design and production tools

Example of how to change the domestic source in Ubuntu 18.04

How to use jconsole to monitor remote Tomcat services

MySQL database constraints and data table design principles

Summary of 4 solutions for returning values on WeChat Mini Program pages

Detailed explanation of CocosCreator message distribution mechanism

CSS to achieve horizontal lines on both sides of the middle text

Summary of fragmented knowledge of Docker management

Button does not specify type as submit. Clicking the button does not jump to the specified URL.

Implementation of LNMP for separate deployment of Docker containers

How to change the character set encoding to UTF8 in MySQL 5.5/5.6 under Linux

Detailed explanation of component development of Vue drop-down menu

Determine whether MySQL update will lock the table through examples