Implementing file content deduplication and intersection and difference in Linux

1. Data Deduplication

In daily work, there may be data duplication when using Hive or Impala to query and export, but you don’t want to re-execute the query (the query time is a bit long and the exported file content is large), so you think of using Linux commands to remove duplicate data from the file content.

The following is an example:

You can see that aaa.txx has 3 duplicate data

I want to remove the redundant data and keep only one

sort aaa.txt | uniq > bbb.txt

Remove duplicate data from the aaa.txt file and output it to bbb.txt

You can see that only one piece of data is retained in the bbb.txt file

2. Data intersection, union, and difference

1) Intersection (equivalent to user_2019 inner join user_2020 on user_2019.user_no=user_2020.user_no)

sort user_2019.txt user_2020.txt | uniq -d

2) Union (equivalent to user_2019.user_no union user_2020.user_no)

sort user_2019.txt user_2020.txt | uniq

3) Difference

user_2019.txt-user_2020.txt
sort user_2019.txt user_2020.txt user_2020.txt | uniq -u
user_2020.txt - user_2019.txt:
sort user_2020.txt user_2019.txt user_2019.txt | uniq -u

The above is the full content of this article. I hope it will be helpful for everyone’s study. I also hope that everyone will support 123WORDPRESS.COM.

You may also be interested in:

How to detect file system integrity based on AIDE in Linux
Detailed explanation of commands to read and write remote files using Vim in Linux system
Detailed explanation of various practical uses of virtual device files in Linux system
Solution to the "No such file or directory" prompt when executing executable files in Linux
How to quickly copy large files under Linux
Detailed explanation of the problem that the space is not released after the Linux file is deleted
Linux file management command example analysis [display, view, statistics, etc.]

<<: In-depth understanding of MySQL long transactions

>>: js to realize a simple disc clock

Implementing file content deduplication and intersection and difference in Linux

Measured image HTTP request

Detailed explanation of Nginx configuration required for front-end

Vue form input binding v-model

Detailed process of installing Docker, creating images, loading and running NodeJS programs

Introduction to HTML DOM_PowerNode Java Academy

Example code for evenly distributing elements using css3 flex layout

Two ways to export csv in win10 mysql

HTML Basics - Simple Example of Setting Hyperlink Style

30 minutes to give you a comprehensive understanding of React Hooks

Detailed tutorial on how to compile and install mysql8.0.29 in CentOS8 deployment LNMP environment

Recommend

React uses routing to redirect to the login interface

How to simply configure multiple servers in nginx

Solution to MySQL service 1067 error: modify the mysql executable file path

Detailed explanation of the idea of achieving the point-earning effect with CSS animation

How to connect a Linux virtual machine to WiFi

Use docker to deploy tomcat and connect to skywalking

Detailed tutorial on customizing the installation path of MySQL 5.7.18 version (binary package installation)

Use Docker to run multiple PHP versions on the server

Causes and solutions for cross-domain issues in Ajax requests

Example of how to quickly build a Redis cluster with Docker

A detailed introduction to the three installation methods of rpm, yum and source code under Linux

Detailed explanation of HTML area tag

How to use Navicat to export and import mysql database

Summary of Button's four Click response methods

Pure CSS to achieve cloudy weather icon effect