bachN
bachN

Reputation: 612

BASH : merge two directories and delete duplicated data

i want to compare the content of two folders and delete duplicated data, actually i wrote a script (BASH) but i think it's not the right way to do it (i use loops to iterate over directories content and a lot of diff commands , that make it too much time consuming).

I'll explain the context :

I have two directories :

1-

  dir1/ 
       Student1/
                homework1 
                homework2 

       Student2/
                homework1
                homework2

2-

  dir2/ 
       Student1/
                homework1
                homework2 

       Student3/
                homework1
                homework2

suppose that student1/homework1 folder contains the same data in dir1 and dir2, unlike homework2 which contains different data

the output directory should contains :

       Student1
              homework1                 //same name , same content ==> keep one homework
              homework2
              homework2_dir2                //same name different content ==> _dir2

       Student2
              homework1 
              homework2 

       Student3
              homework1
              homework2

What do you think the optimal way in term of time and reliability (filenames problem, etc..) to do such kind of operation ?

Thank you ;)

PS: dir* and Student* and homework* are directories

PS2: PLEASE i am not looking to this model of answer :

loop over student 
  loop over student homeworks
      test on homework existance
      diff on homework content
        if diff copy
  end

end

if i have alot of student and alot of homeworks with only one difference (only one homework that differ), the script take alot of time with the above solution

Upvotes: 1

Views: 2047

Answers (3)

ramazan polat
ramazan polat

Reputation: 7910

As far as I understand, you need to merge all files in two different directories into a new directory and you don't want duplicate files or folders.

Let's say you want to merge them into 'merged' directory.

You can do this:

rsync -hrv /dir1 /merged/
rsync -hrv /dir2 /merged/

All files in the /dir1 folder will be copied into /merged folder, then the same process will work for /dir2 folder.

Upvotes: 0

JezC
JezC

Reputation: 1868

Assuming that dir1 and dir2 are relative paths with no directories (i.e. no slashes in dir1 or dir2):

dir1=dir1
dir2=dir2
cd $dir1
BASEDIR=$(pwd)
for studentdir in *
cd $BASEDIR/$studentdir
do
  for homeworkdir in *
  cd $BASEDIR/$studentdir/$homeworkdir
  do
    for workfile in *
    do
      if cmp $workfile ${CMPDIR}/${studentdir}/${homeworkdir}/${workfile} 2>&1 >/dev/null
      then
        altdir=../${studentdir}_${dir2}
        mkdir ../${altdir}
        ln ${CMPDIR}/${studentdir}/${homeworkdir}/${workfile} ${altdir}
      fi
    done
  done
done

I haven't tried this - there may be some typos.

In dir1, recurse into each student folder, and in each student folder into each homework directory.

In each homework directory, use cmp on each file to check whether it is byte identical with the matching file in the dir2 subtree.

If different, create an alternate homework directory in the student directory, and link (ln) the different file in to the alternate directory.

cmp is faster than diff; ln is faster than cp.

That's all, folks.

Upvotes: 1

choroba
choroba

Reputation: 242423

I'm not sure it's faster than your solution, as you didn't post it.

#!/bin/bash

mkdir output
cp -r dir1/* output

cd dir2
for student in Student* ; do
    (
        cd $student
        out_path=../../output/$student
        [[ -d $out_path ]] || mkdir $out_path
        for file in * ; do
            if [[ -f $out_path/$file ]] ; then
                diff -q $file $out_path/$file \
                    || cp $file $out_path/$file'_dir2'
            else
                cp $file $out_path/$student
            fi
        done
    )
done

Upvotes: 0

Related Questions