anonymouse
anonymouse

Reputation: 125

Bash: Check all files in a location against another for existence

I'm after a little help with some Bash scripting (on OSX). I want to create a script that takes two parameters - source folder and target folder - and checks all files in the source hierarchy to see whether or not they exist in the target hierarchy. i.e. Given a data DVD check whether the files contained on it are already on the internal drive.

What I've come up with so far is

#!/bin/bash

if [ $# -ne 2 ]
then
        echo "Usage is command sourcedir targetdir"
        exit 0
fi

source="$1"
target="$2"

for f in "$( find $source -type f -name '*' -print )"
do

I'm now not sure how it's best to obtain the filename without its path and then see if it exists. I am really a beginner at scripting.

Edit: The answers given so far are all very efficient in terms of compact code. However I need to be able to look for files found within the total source hierarchy anywhere within the target hierarchy. If found I would like to compare checksums and last modified dates etc and comment or, if not found, I would like to note this. The purpose is to check whether files on external media have been uploaded to a file server.

Upvotes: 1

Views: 431

Answers (3)

A few remarks about the line for f in "$( find $source -type f -name '*' -print )":

  • Make that "$source". Always use double quotes around variable substitutions. Otherwise the result is split into words that are treated as wildcard patterns (a historical oddity in the shell parsing rules); in particular, this would fail if the value of the variable contain spaces.
  • You can't iterate over the output of find that way. Because of the double quotes, there would be a single iteration through the loop, with $f containing the complete output from find. Without double quotes, file names containing spaces and other special characters would trip the script.
  • -name '*' is a no-op, it matches everything.

As far as I understand, you want to look for files by name independently of their location, i.e. you consider /dvd/path/to/somefile to be a match to /internal-drive/different/path-to/somefile. So make an list of files on each side indexed by name. You can do this by massaging the output of find a little. The code below can cope with any character in file names except newlines.

list_files () {
  find . -type f -print |
  sed 's:^\(.*\)/\(.*\)$:\2/\1/\2:' |
  sort
}
source_files="$(cd "$1" && list_files)"
dest_files="$(cd "$2" && list_files)"
join -t / -v 1 <(echo "$source_files") <(echo "$dest_files") |
sed 's:^[^/]*/::'

The list_files function generates a list of file names with paths, and prepends the file name in front of the files, so e.g. /mnt/dvd/some/dir/filename.txt will appear as filename.txt/./some/dir/filename.txt. It then sorts the files.

The join command prints out lines like filename.txt/./some/dir/filename.txt when there is a file called filename.txt in the source hierarchy but not in the destination hierarchy. We finally massage its output a little since we no longer need the filename at the beginning of the line.

Upvotes: 0

Mark Reed
Mark Reed

Reputation: 95252

To list only files in $source_dir that do not exist in $target_dir:

 comm -23 <(cd "$source_dir" && find .|sort) <(cd "$target_dir" && find .|sort)

You can limit it to just regular files with -f on the find commands, etc.

The comm command (short for "common") finds lines in common between two text files and outputs three columns: lines only in the first file, lines only in the second file, and lines common to both. The numbers suppress the corresponding column, so the output of comm -23 is only the lines from the first file that don't appear in the second.

The process substitution syntax <(command) is replaced by the pathname to a named pipe connected to the output of the given command, which lets you use a "pipe" anywhere you could put a filename, instead of only stdin and stdout.

The commands in this case generate lists of files under the two directories - the cd makes the output relative to the directories being compared, so that corresponding files come out as identical strings, and the sort ensures that comm won't be confused by the same files listed in different order in the two folders.

Upvotes: 1

jedwards
jedwards

Reputation: 30210

This should give you some ideas:

#!/bin/bash

DIR1="tmpa"
DIR2="tmpb"

function sorted_contents
{
    cd "$1"
    find . -type f | sort
}

DIR1_CONTENTS=$(sorted_contents "$DIR1")
DIR2_CONTENTS=$(sorted_contents "$DIR2")

diff -y  <(echo "$DIR1_CONTENTS") <(echo "$DIR2_CONTENTS")

In my test directories, the output was:

[user@host so]$ ./dirdiff.sh
./address-book.dat                             ./address-book.dat
./passwords.txt                                ./passwords.txt
./some-song.mp3                              <
./the-holy-grail.info                          ./the-holy-grail.info
                                             > ./victory.wav
./zzz.wad                                      ./zzz.wad

If its not clear, "some-song.mp3" was only in the first directory while "victory.wav" was only in the second. The rest of the files were common.

Note that this only compares the file names, not the contents. If you like where this is headed, you could play with the diff options (maybe --suppress-common-lines if you want cleaner output).

But this is probably how I'd approach it -- offload a lot of the work onto diff.

EDIT: I should also point out that something as simple as:

[user@host so]$ diff tmpa tmpb

would also work:

    Only in tmpa: some-song.mp3
    Only in tmpb: victory.wav

... but not feel as satisfying as writing a script yourself. :-)

Upvotes: 1

Related Questions