Reputation: 10379
I have a about 20 files in a directory and some of those files are duplicates. Since they have a different name, how can i identify which are duplicates so that I can delete them.
After doing some research I found that md5 or cksum tools can be used but i cant seem to make everything work.
Upvotes: 2
Views: 307
Reputation: 77185
You can identify duplicate files using an awk one-liner.
Let's create some files, of which some would be duplicates.
[jaypal~/Temp]$ cat a.txt
jaypal
[jaypal~/Temp]$ cat b.txt
singh
[jaypal~/Temp]$ cat c.txt
jaypal
[jaypal~/Temp]$ cat d.txt
ayaplj
From the output shown above we know that files a.txt and c.txt are exact duplicates. File d.txt even though has my name re-arranged, cannot be categorized as duplicate.
We will use cksum
utility on each file and capture the output in a separate file.
[jaypal~/Temp]$ cksum a.txt b.txt c.txt d.txt > cksum.txt
[jaypal~/Temp]$ cat cksum.txt
3007025847 7 a.txt
1281385283 6 b.txt
3007025847 7 c.txt
750690976 7 d.txt
Note: I used the above method since there were only 4 files for this demo. If you have hundreds of files to check dups from then use a simple for loop
.
[jaypal~/Temp]$ for i in ./*.txt; do cksum $i >> cksum1.txt; done
[jaypal~/Temp]$ cat cksum1.txt
3007025847 7 ./a.txt
1281385283 6 ./b.txt
3007025847 7 ./c.txt
750690976 7 ./d.txt
Now that we have the cksum.txt file we can use this with our awk
one-liner to identify duplicates.
[jaypal~/Temp]$ awk 'NR==FNR && a[$1]++ { b[$1]; next } $1 in b' cksum.txt cksum.txt
3007025847 7 a.txt
3007025847 7 c.txt
This will list all the files that have more than 1 copies in your directory. Please note that delete any one of these files and not both. :) You can always pipe the output to sort
to get them in order.
Alternatively, you can do the following to get just single duplicate file instead of getting both copies. The reason I am not too fond of this one is because it doesn't show me which duplicate it is of.
[jaypal~/Temp]$ awk '{ x[$1]++; if (x[$1]>1) print $0}' cksum.txt
3007025847 7 c.txt
Upvotes: 1
Reputation: 8637
First, put all the cksums with the files they are from into a temp file:
cksum * > /tmp/blah
Then sort and uniquify the file based on the first 10 chars (the cksum itself), keeping the dupes
sort /tmp/blah | uniq -w 10 -d > /tmp/blah.dups
Then delete those dups:
cut -d" " -f3 /tmp/blah.dups | xargs rm
Upvotes: 1
Reputation: 70173
You can use the sum
command to generate the checksum for a file like so: sum FILENAME
. If two files have the same checksum, it is exceedingly likely (although, depending on the checksum algorithm, not 100% guaranteed) that they are identical.
Upvotes: 0