Philip Kirchhoff
Philip Kirchhoff

Reputation: 27

Create archive from difference of two folders

I have the following problem. There are two nested folders A and B. They are mostly identical, but B has a few files that A does not. (These are two mounted rootfs images). I want to create a shell script that does the following:

  1. Find out which files are contained in B but not in A.
  2. copy the files found in 1. from B and create a tar.gz that contains these files, keeping the folder structure.

The goal is to import the additional data from image B afterwards on an embedded system that contains the contents of image A.

For the first step I put together the following code snippet. Note to grep "Nur" : "Nur in" = "Only in" (german):

diff -rq <A> <B>/ 2>/dev/null | grep Nur | awk '{print substr($3, 1, length($3)-1) "/" substr($4, 1, length($4)-1)}' 

The result is the output of the paths relative to folder B.

I have no idea how to implement the second step. Can someone give me some help?

Upvotes: 1

Views: 128

Answers (2)

tripleee
tripleee

Reputation: 189387

Using diff for finding files which don't exist is severe overkill; you are doing a lot of calculations to compare the contents of the files, where clearly all you care about is whether a file name exists or not.

Maybe try this instead.

tar zcf newfiles.tar.gz $(comm -13 <(cd A && find . -type f | sort) <(cd B && find . -type f | sort) | sed 's/^\./B/')

The find commands produce a listing of the file name hierarchies; comm -13 extracts the elements which are unique to the second input file (which here isn't really a file at all; we are using the shell's process substitution facility to provide the input) and the sed command adds the path into B back to the beginning.

Passing a command substitution $(...) as the argument to tar is problematic; if there are a lot of file names, you will run into "command line too long", and if your file names contain whitespace or other irregularities in them, the shell will mess them up. The standard solution is to use xargs but using xargs tar cf will overwrite the output file if xargs ends up calling tar more than once; though perhaps your tar has an option to read the file names from standard input.

Upvotes: 2

Renaud Pacalet
Renaud Pacalet

Reputation: 29040

With find:

$ mkdir -p A B
$ touch A/a A/b
$ touch B/a B/b B/c B/d
$ cd B
$ find . -type f -exec sh -c '[ ! -f ../A/"$1" ]' _ {} \; -print
./c
./d

The idea is to use the exec action with a shell script that tests the existence of the current file in the other directory. There are a few subtleties:

  • The first argument of sh -c is the script to execute, the second (here _ but could be anything else) corresponds to the $0 positional parameter of the script and the third ({}) is the current file name as set by find and passed to the script as positional parameter $1.
  • The -print action at the end is needed, even if it is normally the default with find, because the use of -exec cancels this default.

Example of use to generate your tarball with GNU tar:

$ cd B
$ find . -type f -exec sh -c '[ ! -f ../A/"$1" ]' _ {} \; -print > ../list.txt
$ tar -c -v -f ../diff.tar --files-from=../list.txt
./c
./d

Note: if you have unusual file names the --verbatim-files-from GNU tar option can help. Or a combination of the -print0 action of find and the --null option of GNU tar.

Note: if the shell is POSIX (e.g., bash) you can also run find from the parent directory and get the path of the files relative from there, if you prefer:

$ mkdir -p A B
$ touch A/a A/b
$ touch B/a B/b B/c B/d
$ find B -type f -exec sh -c '[ ! -f A"${1#B}" ]' _ {} \; -print
B/c
B/d

Upvotes: 2

Related Questions