Reputation: 21353
I have 500 files in a directory and some of them are duplicates. They are called prime-0.png to prime-499.png. I can see they are duplicated by running md5sum for example on them.
How can I delete the duplicate files so I am left with only one copy of each?
md5sum prime-*
gives me:
ed8c994d608ba2fde59e6a08c65bcc1f prime-0.png
e7667b2c92359d23cd1cd251e54b41ba prime-100.png
0afc9b57206cde58ff609a6476bde7a6 prime-101.png
[...]
I can see how many are duplicated by doing:
md5sum prime-* |cut -f1 -d\ |sort|uniq -c|sort -rn|less
which gives me:
5 f905fde6abfbcbb00e079dcd4ecacbb7
3 efcdd042802fc0efc6d9fdf164df4e20
3 ed5a46d250c85809b57ee96385f655d2
3 c4cff53df13b87381b2c06538c339790
[...]
Upvotes: 0
Views: 1234
Reputation: 88959
This answer is only suitable for file names without line breaks.
awk
outputs duplicates in first column:
md5sum prime-* | awk 'n[$1]++' | cut -d " " -f 3- | xargs -I {} echo rm {}
If output looks fine, remove echo
.
From man xargs
:
-I replace-str
: Replace occurrences ofreplace-str
in the initial-arguments with names read from standard input. Also, unquoted blanks do not terminate input items;
Upvotes: 5
Reputation: 50805
Using an associative array, which requires bash >4.0:
declare -A a
for f in prime-*; do
f_md5=$(md5sum <"$f" | cut -c-32)
if [ -n "${a[$f_md5]- }" ]; then
a[$f_md5]=
else
rm -- "$f"
fi
done
Upvotes: 3
Reputation: 141920
Compare no more then characters in md5sum when searching for a duplicate with uniq -D
. Then join the list with one file per duplicates group with uniq -d
.
# the input file
# files 102-105 are to be removed
cat <<EOF |
ed8c994d608ba2fde59e6a08c65bcc1f prime-0.png
e7667b2c92359d23cd1cd251e54b41ba prime-100.png
0afc9b57206cde58ff609a6476bde7a6 prime-101.png
0afc9b57206cde58ff609a6476bde7a6 prime-102.png
0afc9b57206cde58ff609a6476bde7a6 prime-103.png
0afc9b57206cde58ff609a6476bde7a6 prime-104.png
e7667b2c92359d23cd1cd251e54b41ba prime-105.png
EOF
# sort with md5sums
# save to temporary file
sort -t' ' -k1 > tmp1
# we print all duplicates with first uniq
# and print only one duplicate per group with second group
# then we find elements in the first stream not in the second
comm -23 <(uniq -w32 -D tmp1) <(uniq -w32 -d tmp1) |
# extract the filename
cut -d' ' -f3
will output:
prime-102.png
prime-103.png
prime-104.png
prime-105.png
Live version at repl.
The magic constant 32
is the length of characters of md5sum, ie. the output of echo -n '0afc9b57206cde58ff609a6476bde7a6' | wc -c
.
Upvotes: 1