Detailed explanation of identifying files with the same content on Linux

Preface

Sometimes file copies amount to a huge waste of hard drive space and can cause confusion when you want to update a file. Here are six commands for identifying these files.

In a recent post, we looked at how to identify and locate files that are hard linked (i.e., refer to the same hard disk content and share an inode). In this article, we will look at commands that can find files with the same content but not linked to each other.

Hard links are useful because they enable files to be stored in multiple places within the file system without taking up additional hard drive space. On the other hand, sometimes file copies can be a huge waste of hard drive space and can cause inconvenience when you want to update files. In this article, we'll look at a number of ways to identify these files.

Comparing files with the diff command

Probably the easiest way to compare two files is to use the diff command. The output will show the differences between your files. The < and > symbols indicate whether there are extra lines of text in the first ( < ) or second ( > ) file passed as arguments. In this example, there are extra lines of text in backup.html.

$ diff index.html backup.html
2438a2439,2441
> <pre>
> That's all there is to report.
> </pre>

If diff gives no output then the two files are identical.

$ diff home.html index.html
$

The only downside to diff is that it can only compare two files at a time and you have to specify the files to compare. Some of the commands in this post can find multiple duplicate files for you.

Using Checksums

The cksum (checksum) command calculates the checksum of a file. A checksum is a mathematical simplification that converts text content into a long number (for example, 2819078353 228029). While checksums are not completely unique, the chances of files with different contents having the same checksum are extremely small.

$ cksum *.html
2819078353 228029 backup.html
4073570409 227985 home.html
4073570409 227985 index.html

In the above example, you can see how the second and third files that produce the same checksum can be assumed to be identical.

Using the find command

Although the find command does not have an option to search for duplicate files, it can still be used to find files by name or type and run the cksum command. For example:

$ find . -name "*.html" -exec cksum {} \;
4073570409 227985 ./home.html
2819078353 228029 ./backup.html
4073570409 227985 ./index.html

Using fslint command

The fslint command can be used specifically to find duplicate files. Notice that we gave it a starting position. If it needs to iterate over a large number of files, this can take a while to complete. Notice how it lists duplicate files and looks for other problems, such as empty directories and bad IDs.

$ fslint .
-----------------------------------file name lint
-------------------------------Invalid utf8 names
-----------------------------------file case lint
----------------------------------DUPlicate files <==
home.html
index.html
-----------------------------------Dangling links
--------------------redundant characters in links
------------------------------------suspect links
--------------------------------Empty Directories
./.gnupg
----------------------------------Temporary Files
----------------------duplicate/conflicting Names
------------------------------------------Bad ids
-------------------------Non Stripped executables

You may need to install fslint on your system. You may also need to add it to your command search path:

$ export PATH=$PATH:/usr/share/fslint/fslint

Using the rdfind command

The rdfind command also looks for duplicate (same content) files. Its name means "duplicate finder," and it can determine which file is the original based on the file date -- this is useful if you choose to delete copies because it will remove the newer file.

$ rdfind ~
Now scanning "/home/shark", found 12 files.
Now there are 12 files in total.
Removed 1 files due to nonunique device and inode.
Total size is 699498 bytes or 683 KiB
Removed 9 files due to unique sizes from list. 2 files left.
Now eliminating candidates based on first bytes: removed 0 files from list. 2 files left.
Now eliminating candidates based on last bytes: removed 0 files from list. 2 files left.
Now eliminating candidates based on sha1 checksum: removed 0 files from list. 2 files left.
It seems like you have 2 files that are not unique
Totally, 223 KiB can be reduced.
Now making results file results.txt

You can run this command in dryrun mode (in other words, only report changes that might otherwise be made).

$ rdfind -dryrun true ~
(DRYRUN MODE) Now scanning "/home/shark", found 12 files.
(DRYRUN MODE) Now have 12 files in total.
(DRYRUN MODE) Removed 1 files due to nonunique device and inode.
(DRYRUN MODE) Total size is 699352 bytes or 683 KiB
Removed 9 files due to unique sizes from list. 2 files left.
(DRYRUN MODE) Now eliminating candidates based on first bytes: removed 0 files from list. 2 files left.
(DRYRUN MODE) Now eliminating candidates based on last bytes: removed 0 files from list. 2 files left.
(DRYRUN MODE) Now eliminating candidates based on sha1 checksum: removed 0 files from list. 2 files left.
(DRYRUN MODE) It seems like you have 2 files that are not unique
(DRYRUN MODE) Totally, 223 KiB can be reduced.
(DRYRUN MODE) Now making results file results.txt

The rdfind command also provides features like ignoring empty files (-ignoreempty) and following symbolic links (-followsymlinks). See the man page for explanation.

-ignoreempty ignore empty files
-minsize ignore files smaller than specified size
-followsymlinks follow symbolic links
-removeidentinode remove files referring to identical inode
-checksum identify checksum type to be used
-deterministic determines how to sort files
-makesymlinks turn duplicate files into symbolic links
-makehardlinks replace duplicate files with hard links
-makeresultsfile creates a results file in the current directory
-outputname provide name for results file
-deleteduplicates delete/unlink duplicate files
-sleep set sleep time between reading files (milliseconds)
-n, -dryrun display what would have been done, but don't do it

Note that the rdfind command provides the -deleteduplicates true option to delete duplicates. Hopefully this little quibble in command syntax won't annoy you. ;-)

$ rdfind -deleteduplicates true .
...
Deleted 1 files. <==

You will probably need to install the rdfind command on your system. It might be a good idea to experiment with it to get familiar with how to use it.

Using the fdupes command

The fdupes command also makes it easy to identify duplicate files. It also provides a number of useful options - such as -r for iteration. In this example, it groups duplicate files together like this:

$ fdupes ~
/home/shs/UPGRADE
/home/shs/mytwin

/home/shs/lp.txt
/home/shs/lp.man

/home/shs/penguin.png
/home/shs/penguin0.png
/home/shs/hideme.png

This is an example of using iteration, note that many of the duplicate files are important (the user's .bashrc and .profile files) and should not be deleted.

# fdupes -r /home
/home/shark/home.html
/home/shark/index.html

/home/dory/.bashrc
/home/eel/.bashrc

/home/nemo/.profile
/home/dory/.profile
/home/shark/.profile

/home/nemo/tryme
/home/shs/tryme

/home/shs/arrow.png
/home/shs/PNGs/arrow.png

/home/shs/11/files_11.zip
/home/shs/ERIC/file_11.zip

/home/shs/penguin0.jpg
/home/shs/PNGs/penguin.jpg
/home/shs/PNGs/penguin0.jpg

/home/shs/Sandra_rotated.png
/home/shs/PNGs/Sandra_rotated.png

The fdupe command has many options listed below. Use the fdupes -h command or read the man page for details.

-r --recurse recurse
-R --recurse: recurse through specified directories
-s --symlinks follow symlinked directories
-H --hardlinks treat hard links as duplicates
-n --noempty ignore empty files
-f --omitfirst omit the first file in each set of matches
-A --nohidden ignore hidden files
-1 --sameline list matches on a single line
-S --size show size of duplicate files
-m --summarize summarize duplicate files information
-q --quiet hide progress indicator
-d --delete prompt user for files to preserve
-N --noprompt when used with --delete, preserve the first file in set
-I --immediate delete duplicates as they are encountered
-p --permissions don't soncider files with different owner/group or
         permission bits as duplicates
-o --order=WORD order files according to specification
-i --reverse reverse order while sorting
-v --version display fdupes version
-h --help displays help

The fdupes command is another one that you may need to install and use for a while before you become familiar with its many options.

Summarize

Linux systems provide a good range of tools that can locate and (potentially) remove duplicate files, as well as options that let you specify where to search and what to do with the duplicates you find.

Well, that’s all for this article. I hope the content of this article will be of certain reference value to your study or work. Thank you for your support of 123WORDPRESS.COM.

via: https://www.networkworld.com/article/3390204/how-to-identify-same-content-files-on-linux.html#tk.rss_all

Author: Sandra Henry-Stocker Topic: lujun9972 Translator: tomjlw Proofreader: wxy

You may also be interested in:

Linux du command to view folder sizes and sort in descending order
How to view and modify file read and write permissions in Linux system
How to use du to view the size of disk space occupied by a file or directory in Linux
Detailed examples of viewing file attributes in Linux (ls, lsattr, file, stat)
How to view the number of file handles opened by a process and how to modify them in Linux
How to view folder sizes and sort by size in Linux
View and modify directory file permissions (commands) under Linux
A simple way to view the file system block size and memory page size in Linux
How to check the block size of Linux file system
How to view the type of mounted file system in Linux

<<: Implementing CommonJS modularity in browsers without compilation/server

>>: Summary of Linux command methods to view used commands