Reputation: 119

Finding duplicate files according to md5 with bash

I want to write an algorithm about bash that it finds duplicate files

How can I add size option?

Upvotes: 11

Answers (7)

jlaraval

Reputation: 13

If you can't use *dupes for any reason and the number of files is very high the sort+uniq won't have a good performance. In this case you could use something like this:

find . -not -empty -type f -printf "%012s" -exec md5sum {} \; | awk 'x[substr($0, 1, 44)]++'

find will create a line for each file with the filesize in bytes (I used 12 positions but YMMV) and the md5 hash of the file (plus the name).
awk will filter the results without the need of being sorted previously. The 44 stands for 12 (for the filesize) + 32 (length of the hash). If you need some explanation about the awk program you can see the basics here.

Upvotes: 0

Peacher Wu

Reputation: 461

This might be a late answer, but there are much faster alternatives to fdupes now.

fslint/findup
jdupes, which is supposed to be a faster replacement for fdupes

I have had the time to do a small test. For a folder with 54,000 files of a total size 17G, on a standard (8 vCPU/30G) Google Virtual Machine:

fdupes takes 2m 47.082s
findup takes 13.556s
jdupes takes 0.165s

However, my experience is that, if your folder is too large, the time might become very long too (hours, if not days) since pairwise comparison (or sorting at best) and extremely memory-hungry operations soon become unbearably slow. Runnig a task like this on an entire disk is out of the question.

Upvotes: 4

EvilKittenLord

Reputation: 922

find . -not -empty -type f -printf "%s\n" | sort -rn | uniq -d |\
xargs -I{} -n1 find . -type f -size {}c -print0 | xargs -0 md5sum |\
sort | uniq -w32 --all-repeated=separate

This is how you'd want to do it. This code locates dups based on size first, then MD5 hash. Note the use of -size, in relation to your question. Enjoy. Assumes you want to search in the current directory. If not, change the find . to be appropriate for for the directory(ies) you'd like to search.

Upvotes: 21

Ondra Žižka

Reputation: 46904

Normally I use fdupes -r -S .. But when I search for duplicates of lower amount of very large files, fdupes takes very long to finish as it does a full checksum of the whole file (I guess).

I've avoided that by comparing only the first 1 megabyte. It's not super-safe and you have to check if it's really a duplicate if you want to be 100 % sure. But the chance of two different videos (my case) having the same 1st megabyte but different further content is rather theorethical.

So I have written this script. Another trick it does to speed up is that it stores the resulting hash for certain path into a file. I rely on the fact that the files don't change.

I paste this code to a console rather than running it - for that, it would need some more work, but here you have the idea:

find -type f -size +3M -print0 | while IFS= read -r -d '' i; do
  echo -n '.'
  if grep -q "$i" md5-partial.txt; then
    echo -n ':'; #-e "\n$i  ---- Already counted, skipping.";
    continue;
  fi
  MD5=`dd bs=1M count=1 if="$i" status=none | md5sum`
  MD5=`echo $MD5 | cut -d' ' -f1`
  if grep "$MD5" md5-partial.txt; then echo -e "Duplicate: $i"; fi
  echo $MD5 $i >> md5-partial.txt
done
fi

## Show the duplicates
#sort md5-partial.txt | uniq  --check-chars=32 -d -c | sort -b -n | cut -c 9-40 | xargs -I '{}' sh -c "grep '{}'  md5-partial.txt && echo"

Another bash snippet which use to determine the largest duplicate files:

## Show wasted space
if [ false ] ; then
sort md5-partial.txt | uniq  --check-chars=32 -d -c | while IFS= read -r -d '' LINE; do
  HASH=`echo $LINE | cut -c 9-40`;
  PATH=`echo $LINE | cut -c 41-`;
  ls -l '$PATH' | cud -c 26-34
done

Both these scripts have a lot of space for improvements, feel free to contribute - here is the gist :)

Upvotes: 2

Gilles Quénot

Reputation: 185851

Don't reinvent the wheel, use the proper command :

fdupes -r dir

See http://code.google.com/p/fdupes/ (packaged on some Linux distros)

Upvotes: 21

anubhava

Reputation: 786289

You can make use of cmp to compare file size like this:

#!/bin/bash

folder1="$1"
folder2="$2"
log=~/log.txt

for i in "$folder1"/*; do
    filename="${i%.*}"
    cmp --silent "$folder1/$filename" "$folder2/$filename" && echo "$filename" >> "$log"
done

Upvotes: 0

Drake Clarris

Reputation: 1055

find /path/to/folder1 /path/to/folder2 -type f -printf "%f %s\n" | sort | uniq -d

The find command looks in two folders for files, prints file name only (stripping leading directories) and size, sort and show only dupes. This does assume there are no newlines in the file names.

Upvotes: 3

Finding duplicate files according to md5 with bash

Answers (7)

Related Questions