Reputation: 10379

Identify duplicates and delete

I have a about 20 files in a directory and some of those files are duplicates. Since they have a different name, how can i identify which are duplicates so that I can delete them.

After doing some research I found that md5 or cksum tools can be used but i cant seem to make everything work.

Upvotes: 2

Answers (3)

jaypal singh

Reputation: 77185

You can identify duplicate files using an awk one-liner.

Let's create some files, of which some would be duplicates.

[jaypal~/Temp]$ cat a.txt 
jaypal
[jaypal~/Temp]$ cat b.txt 
singh
[jaypal~/Temp]$ cat c.txt 
jaypal
[jaypal~/Temp]$ cat d.txt 
ayaplj

From the output shown above we know that files a.txt and c.txt are exact duplicates. File d.txt even though has my name re-arranged, cannot be categorized as duplicate.

We will use cksum utility on each file and capture the output in a separate file.

[jaypal~/Temp]$ cksum a.txt b.txt c.txt d.txt > cksum.txt
[jaypal~/Temp]$ cat cksum.txt 
3007025847 7 a.txt
1281385283 6 b.txt
3007025847 7 c.txt
750690976 7 d.txt

Note: I used the above method since there were only 4 files for this demo. If you have hundreds of files to check dups from then use a simple for loop.

[jaypal~/Temp]$ for i in ./*.txt; do cksum $i >> cksum1.txt; done
[jaypal~/Temp]$ cat cksum1.txt 
3007025847 7 ./a.txt
1281385283 6 ./b.txt
3007025847 7 ./c.txt
750690976 7 ./d.txt

Now that we have the cksum.txt file we can use this with our awk one-liner to identify duplicates.

[jaypal~/Temp]$ awk 'NR==FNR && a[$1]++ { b[$1]; next } $1 in b' cksum.txt cksum.txt 
3007025847 7 a.txt
3007025847 7 c.txt

This will list all the files that have more than 1 copies in your directory. Please note that delete any one of these files and not both. :) You can always pipe the output to sort to get them in order.

Alternatively, you can do the following to get just single duplicate file instead of getting both copies. The reason I am not too fond of this one is because it doesn't show me which duplicate it is of.

[jaypal~/Temp]$ awk '{ x[$1]++; if (x[$1]>1) print $0}' cksum.txt 
3007025847 7 c.txt

Upvotes: 1

drysdam

Reputation: 8637

First, put all the cksums with the files they are from into a temp file:

cksum * > /tmp/blah

Then sort and uniquify the file based on the first 10 chars (the cksum itself), keeping the dupes

sort /tmp/blah | uniq -w 10 -d > /tmp/blah.dups

Then delete those dups:

cut -d" " -f3 /tmp/blah.dups | xargs rm

Upvotes: 1

Trott

Reputation: 70173

You can use the sum command to generate the checksum for a file like so: sum FILENAME. If two files have the same checksum, it is exceedingly likely (although, depending on the checksum algorithm, not 100% guaranteed) that they are identical.

Upvotes: 0

Identify duplicates and delete

Answers (3)

Related Questions