Simd
Simd

Reputation: 21353

How to remove duplicate files in linux

I have 500 files in a directory and some of them are duplicates. They are called prime-0.png to prime-499.png. I can see they are duplicated by running md5sum for example on them.

How can I delete the duplicate files so I am left with only one copy of each?

md5sum prime-* 

gives me:

ed8c994d608ba2fde59e6a08c65bcc1f  prime-0.png
e7667b2c92359d23cd1cd251e54b41ba  prime-100.png
0afc9b57206cde58ff609a6476bde7a6  prime-101.png
[...]

I can see how many are duplicated by doing:

md5sum prime-* |cut -f1 -d\ |sort|uniq -c|sort -rn|less

which gives me:

5 f905fde6abfbcbb00e079dcd4ecacbb7
3 efcdd042802fc0efc6d9fdf164df4e20
3 ed5a46d250c85809b57ee96385f655d2
3 c4cff53df13b87381b2c06538c339790
[...]

Upvotes: 0

Views: 1234

Answers (3)

Cyrus
Cyrus

Reputation: 88959

This answer is only suitable for file names without line breaks.

awk outputs duplicates in first column:

md5sum prime-* | awk 'n[$1]++' | cut -d " " -f 3- | xargs -I {} echo rm {}

If output looks fine, remove echo.


From man xargs:

-I replace-str: Replace occurrences of replace-str in the initial-arguments with names read from standard input. Also, unquoted blanks do not terminate input items;

Upvotes: 5

oguz ismail
oguz ismail

Reputation: 50805

Using an associative array, which requires bash >4.0:

declare -A a

for f in prime-*; do
  f_md5=$(md5sum <"$f" | cut -c-32)
  if [ -n "${a[$f_md5]- }" ]; then
    a[$f_md5]=
  else
    rm -- "$f"
  fi
done

Upvotes: 3

KamilCuk
KamilCuk

Reputation: 141920

Compare no more then characters in md5sum when searching for a duplicate with uniq -D. Then join the list with one file per duplicates group with uniq -d.

# the input file
# files 102-105 are to be removed
cat <<EOF |
ed8c994d608ba2fde59e6a08c65bcc1f  prime-0.png
e7667b2c92359d23cd1cd251e54b41ba  prime-100.png
0afc9b57206cde58ff609a6476bde7a6  prime-101.png
0afc9b57206cde58ff609a6476bde7a6  prime-102.png
0afc9b57206cde58ff609a6476bde7a6  prime-103.png
0afc9b57206cde58ff609a6476bde7a6  prime-104.png
e7667b2c92359d23cd1cd251e54b41ba  prime-105.png
EOF
# sort with md5sums
# save to temporary file
sort -t' ' -k1 > tmp1

# we print all duplicates with first uniq
# and print only one duplicate per group with second group
# then we find elements in the first stream not in the second
comm -23 <(uniq -w32 -D tmp1) <(uniq -w32 -d tmp1) |
# extract the filename
cut -d' ' -f3

will output:

prime-102.png
prime-103.png
prime-104.png
prime-105.png

Live version at repl.

The magic constant 32 is the length of characters of md5sum, ie. the output of echo -n '0afc9b57206cde58ff609a6476bde7a6' | wc -c.

Upvotes: 1

Related Questions