Graph4Me Consultant
Graph4Me Consultant

Reputation: 353

Find common files between two folders

Given two root folders A and B,

how can I find duplicate text files between subfolders of A and of B ?

In other words, I am considering the intersection of files from A and B.

I dont want to find duplicate files within A, or within B, but only files, that are in A and in B.

edit

By duplicate I mean files with the same content

Upvotes: 7

Views: 3925

Answers (2)

P....
P....

Reputation: 18351

comm -1 -2 <(ls dir1 | sort) <(ls dir2 | sort)
f1

ls -1 dir1
f1
f2
f3


ls -1 dir2
f1
f4
f5


comm -1 -2 <(ls dir1 | sort) <(ls dir2 | sort)
f1

#If not on bash,then 
bash -c 'comm -1 -2 <(ls dir1 | sort) <(ls dir2 | sort)'

Or using find.

find dir1 dir2 -type f -ls | awk -F'/' 'N[$2]++ {print $NF}'
f1

Or for full path

find dir1 dir2 -type f -ls | awk '{print $NF}' | awk -F'/' 'N[$2]++'
dir2/f1

For finding dups in terms of contents.

files1=(dir1/*)
files2=(dir2/*)


for item1 in ${files1[*]}

do
   ck1=$(cksum $item1 |awk '{print $1}')
   for item2 in ${files2[*]}
       do
         ck2=$(cksum $item2 |awk '{print $1}')

         if [ "$ck1" == "$ck2" ];then
            echo "Duplicate entry found for $item1 and $item2"
         fi
   done


done

Upvotes: 7

Mark Setchell
Mark Setchell

Reputation: 207425

As indicated in the comments section, I would generate a single MD5 checksum for each file, just once - then look for duplicated checksums.

Something like this:

find DirA -name \*.txt -exec md5sum {} +  > /tmp/a
find DirB -name \*.txt -exec md5sum {} +  > /tmp/b

Now find all those checksums that occur in both files.

So, along these lines:

awk 'FNR==NR{md5[$1];next}$1 in md5' /tmp/[ab]

or maybe like this:

awk 'FNR==NR{s=$1;md5[s];$1="";name[s]=$0;next}$1 in md5{s=$1;$1="";print name[s] " : " $0}' /tmp/[ab]

Upvotes: 4

Related Questions