Reputation: 119
I want to write an algorithm about bash that it finds duplicate files
How can I add size option?
Upvotes: 11
Views: 16888
Reputation: 13
If you can't use *dupes for any reason and the number of files is very high the sort+uniq
won't have a good performance. In this case you could use something like this:
find . -not -empty -type f -printf "%012s" -exec md5sum {} \; | awk 'x[substr($0, 1, 44)]++'
find
will create a line for each file with the filesize in bytes (I used 12 positions but YMMV) and the md5 hash of the file (plus the name).
awk
will filter the results without the need of being sorted previously. The 44 stands for 12 (for the filesize) + 32 (length of the hash). If you need some explanation about the awk program you can see the basics here.
Upvotes: 0
Reputation: 461
This might be a late answer, but there are much faster alternatives to fdupes
now.
I have had the time to do a small test. For a folder with 54,000 files of a total size 17G, on a standard (8 vCPU/30G) Google Virtual Machine:
fdupes
takes 2m 47.082sfindup
takes 13.556sjdupes
takes 0.165sHowever, my experience is that, if your folder is too large, the time might become very long too (hours, if not days) since pairwise comparison (or sorting at best) and extremely memory-hungry operations soon become unbearably slow. Runnig a task like this on an entire disk is out of the question.
Upvotes: 4
Reputation: 922
find . -not -empty -type f -printf "%s\n" | sort -rn | uniq -d |\
xargs -I{} -n1 find . -type f -size {}c -print0 | xargs -0 md5sum |\
sort | uniq -w32 --all-repeated=separate
This is how you'd want to do it. This code locates dups based on size first, then MD5 hash. Note the use of -size
, in relation to your question. Enjoy. Assumes you want to search in the current directory. If not, change the find .
to be appropriate for for the directory(ies) you'd like to search.
Upvotes: 21
Reputation: 46904
Normally I use fdupes -r -S .
. But when I search for duplicates of lower amount of very large files, fdupes
takes very long to finish as it does a full checksum of the whole file (I guess).
I've avoided that by comparing only the first 1 megabyte. It's not super-safe and you have to check if it's really a duplicate if you want to be 100 % sure. But the chance of two different videos (my case) having the same 1st megabyte but different further content is rather theorethical.
So I have written this script. Another trick it does to speed up is that it stores the resulting hash for certain path into a file. I rely on the fact that the files don't change.
I paste this code to a console rather than running it - for that, it would need some more work, but here you have the idea:
find -type f -size +3M -print0 | while IFS= read -r -d '' i; do
echo -n '.'
if grep -q "$i" md5-partial.txt; then
echo -n ':'; #-e "\n$i ---- Already counted, skipping.";
continue;
fi
MD5=`dd bs=1M count=1 if="$i" status=none | md5sum`
MD5=`echo $MD5 | cut -d' ' -f1`
if grep "$MD5" md5-partial.txt; then echo -e "Duplicate: $i"; fi
echo $MD5 $i >> md5-partial.txt
done
fi
## Show the duplicates
#sort md5-partial.txt | uniq --check-chars=32 -d -c | sort -b -n | cut -c 9-40 | xargs -I '{}' sh -c "grep '{}' md5-partial.txt && echo"
Another bash snippet which use to determine the largest duplicate files:
## Show wasted space
if [ false ] ; then
sort md5-partial.txt | uniq --check-chars=32 -d -c | while IFS= read -r -d '' LINE; do
HASH=`echo $LINE | cut -c 9-40`;
PATH=`echo $LINE | cut -c 41-`;
ls -l '$PATH' | cud -c 26-34
done
Both these scripts have a lot of space for improvements, feel free to contribute - here is the gist :)
Upvotes: 2
Reputation: 185851
Don't reinvent the wheel, use the proper command :
fdupes -r dir
See http://code.google.com/p/fdupes/ (packaged on some Linux distros)
Upvotes: 21
Reputation: 786289
You can make use of cmp
to compare file size like this:
#!/bin/bash
folder1="$1"
folder2="$2"
log=~/log.txt
for i in "$folder1"/*; do
filename="${i%.*}"
cmp --silent "$folder1/$filename" "$folder2/$filename" && echo "$filename" >> "$log"
done
Upvotes: 0
Reputation: 1055
find /path/to/folder1 /path/to/folder2 -type f -printf "%f %s\n" | sort | uniq -d
The find command looks in two folders for files, prints file name only (stripping leading directories) and size, sort and show only dupes. This does assume there are no newlines in the file names.
Upvotes: 3