Reputation: 3930
I need to find every duplicate filenames in a given directory tree. I don't know, what directory tree user will give as a script argument, so I don't know the directory hierarchy. I tried this:
#!/bin/sh
find -type f | while IFS= read vo
do
echo `basename "$vo"`
done
but that's not really what I want. It finds only one duplicate and then ends, even, if there are more duplicate filenames, also - it doesn't print a whole path (prints only a filename) and duplicate count. I wanted to do something similar to this command:
find DIRNAME | tr '[A-Z]' '[a-z]' | sort | uniq -c | grep -v " 1 "
but it doesn't work for me, don't know why. Even if I have a duplicates, it prints nothing.
Upvotes: 27
Views: 46544
Reputation: 1127
All those loops and temporary files in other answers seem a bit cumbersome.
find /PATH/TO/FILES -type f -printf '%p/ %f\n' | sort -k2 | uniq -f1 --all-repeated=separate
It has its limitations due to uniq
and sort
:
uniq
and sort
)uniq
doesn't support comparing only 1 field and is inflexible with field delimiters)But it is quite flexible regarding its output thanks to find -printf
and works well for me. Also seems to be what @yak tried to achieve originally.
Demonstrating some of the options you have with this:
find /PATH/TO/FILES -type f -printf 'size: %s bytes, modified at: %t, path: %h/, file name: %f\n' | sort -k15 | uniq -f14 --all-repeated=prepend
Also there are options in sort
and uniq
to ignore case (as the topic opener intended to achieve by piping through tr
). Look them up using man uniq
or man sort
.
Upvotes: 27
Reputation: 11
Perhaps the easiest solution is to use cksfv
alongside the -R
switch which will end up calculating recursively a CRC for every file located in a specific path. We can rely on the awk
sort
and cut
commands to come up with a clean results file.
The first step would be to use the following syntax cksfv -R /parent/path/ > output.txt
The output of cksfv
will look like this:
/parent/path/folder1/file1 E113452E
/parent/path/folder1/file2 GE133453
/parent/path/folder2/file1 A441292E
etc
Once the results are in output.txt
we can rely on the calculated CRC32 values at the end of every line to find any duplicates using the awk '{print $NF,$0}' output.txt | sort | cut -f2- -d' '
chain of commands.
At this point we should have all of lines/files sorted by their CRC32 value alongside the path to which they are located in. The final step will be to arrange clean-up using whichever method of our choosing.
Of course, deploying the cksfv
package on any modern Debian/Ubuntu system, can be done by invoking sudo apt install cksfv
A more modern(read: faster) approach will be to deploy cksfv-rs
via rust
or by visiting https://github.com/althonos/cksfv.rs
You can learn more about cksfv
and cksfv-rs
by reading my article in Linux-Magazine at https://www.linux-magazine.com/Issues/2024/287/cksfv
Cheers!
Upvotes: 1
Reputation: 11
Just stumbled upon this interesting case lately. Sharing my solution here even the question is long outdated.
#!/bin/sh
list=$(mktemp)
find PATH/TO/DIR/ -type f -printf '%f\t%p\n' | sort -f >$list
cut -d\^I -f1 <$list | uniq -d -i | join -i -t\^I - $list
rm $list
Quick notes:
Directory tree:
a/f1
a/f2
a/f3
b/f2
c/f2
c/f3
Output:
f2 a/f2
f2 b/f2
f2 c/f2
f3 a/f3
f3 c/f3
Upvotes: 1
Reputation: 2000
Here is another solution (based on the suggestion by @jim-mcnamara) without awk:
Solution 1
#!/bin/sh
dirname=/path/to/directory
find $dirname -type f | sed 's_.*/__' | sort| uniq -d|
while read fileName
do
find $dirname -type f | grep "$fileName"
done
However, you have to do the same search twice. This can become very slow if you have to search a lot of data. Saving the "find" results in a temporary file might give a better performance.
Solution 2 (with temporary file)
#!/bin/sh
dirname=/path/to/directory
tempfile=myTempfileName
find $dirname -type f > $tempfile
cat $tempfile | sed 's_.*/__' | sort | uniq -d|
while read fileName
do
grep "/$fileName" $tempfile
done
#rm -f $tempfile
Since you might not want to write a temp file on the harddrive in some cases, you can choose the method which fits your needs. Both examples print out the full path of the file.
Bonus question here: Is it possible to save the whole output of the find command as a list to a variable?
Upvotes: 27
Reputation: 151
Here is my contribution (this just searches for a specific file type, pdfs in this case) but it does so recursively:
#!/usr/bin/env bash
find . -type f | while read filename; do
filename=$(basename -- "$filename")
extension="${filename##*.}"
if [[ $extension == "pdf" ]]; then
fileNameCount=`find . -iname "$filename" | wc -l`
if [[ $fileNameCount -gt 1 ]]; then
echo "File Name: $filename, count: $fileNameCount"
fi
fi
done
Upvotes: 0
Reputation: 464
One "find" command only:
lst=$( find . -type f )
echo "$lst" | rev | cut -f 1 -d/ | rev | sort -f | uniq -i | while read f; do
names=$( echo "$lst" | grep -i -- "/$f$" )
n=$( echo "$names" | wc -l )
[ $n -gt 1 ] && echo -e "Duplicates found ($n):\n$names"
done
Upvotes: 2
Reputation: 877
This solution writes one temporary file to a temporary directory for every unique filename found. In the temporary file, I write the path where I first found the unique filename, so that I can output it later. So, I create a lot more files that other posted solutions. But, it was something I could understand.
Following is the script, named fndupe
.
#!/bin/bash
# Create a temp directory to contain placeholder files.
tmp_dir=`mktemp -d`
# Get paths of files to test from standard input.
while read p; do
fname=$(basename "$p")
tmp_path=$tmp_dir/$fname
if [[ -e $tmp_path ]]; then
q=`cat "$tmp_path"`
echo "duplicate: $p"
echo " first: $q"
else
echo $p > "$tmp_path"
fi
done
exit
Following is an example of using the script.
$ find . -name '*.tif' | fndupe
Following is example output when the script finds duplicate filenames.
duplicate: a/b/extra/gobble.tif
first: a/b/gobble.tif
Tested with Bash version: GNU bash, version 4.1.2(1)-release (x86_64-redhat-linux-gnu)
Upvotes: 0
Reputation: 602
#!/bin/bash
file=`mktemp /tmp/duplicates.XXXXX` || { echo "Error creating tmp file"; exit 1; }
find $1 -type f |sort > $file
awk -F/ '{print tolower($NF)}' $file |
uniq -c|
awk '$1>1 { sub(/^[[:space:]]+[[:digit:]]+[[:space:]]+/,""); print }'|
while read line;
do grep -i "$line" $file;
done
rm $file
And it also work with spaces in filenames. Here's a simple test (the first argument is the directory):
./duplicates.sh ./test
./test/2/INC 255286
./test/INC 255286
Upvotes: 2
Reputation: 16379
#!/bin/sh
dirname=/path/to/check
find $dirname -type f |
while read vo
do
echo `basename "$vo"`
done | awk '{arr[$0]++; next} END{for (i in arr){if(arr[i]>1){print i}}}
Upvotes: 8