user1763508
user1763508

Reputation:

Find all duplicate subdirectories in directory

I need to make a shell script that "lists all identical sub-directories (recursively) under the current working directory."

I'm new to shell scripts. How do I approach this?

To me, this means:

It would have been the most complicated program I'd have ever written, so I assume I'm just not aware of some shell command to do most of it for me?

I.e., how should I have approached this? All the other parts were about googling until I discovered the shell command that did it 90% of it for me.

(For a previous assignment that I wasn't able to finish, took a zero on this part, need to know how to approach it in the future.)

Upvotes: 0

Views: 493

Answers (2)

James Brown
James Brown

Reputation: 37404

Maybe something like this:

$ find -type d -exec sh -c "echo -n {}\  ; sh -c \"ls -s {}; basename {}\"|md5sum " \; | awk '$2 in a {print "Match:"; print a[$2], $1; next} a[$2]=$1{next}'
Match:
./bar/foo ./foo

find all directories: find -type d, output:

.
./bar
./bar/foo
./foo

ls -s {}; basename {} will print the simplified directory listing and the basename of the directory listed, for example for directory foo: ls -s foo; basename foo

total 0
0 test
foo

Those will cover the files in each dir, their sizes and the dir name. That output will be sent to md5sum and that along the dir:

. 674e2573b49826d4e32dfe81d9680369  -
./bar 4c2d588c5fa9781ad63ad8e86e575e01  -
./bar/foo ff8d1569685be86366f18ea89851db35  -
./foo ff8d1569685be86366f18ea89851db35  -

will be sent to awk:

$2 in a {            # hash as array key
    print "Match:"   # separate hits in output
    print a[$2], $1  # print matching dirscompared to
    next             # next record
} 
a[$2]=$1 {next}      # only first match is stored and 

Test dir structure:

$ mkdir -p test/foo; mkdir -p test/bar/foo; touch test/foo/test; touch test/bar/foo/test
$ find test/
test/
test/bar
test/bar/foo
test/bar/foo/test  # touch test
test/foo
test/foo/test      # touch test

Upvotes: 1

Alfe
Alfe

Reputation: 59426

I'd be surprised to hear that there is a special Unix tool or special usage of a standard Unix tool to do exactly what you describe. Maybe your understanding of the task is more complex than what the task giver intended. Maybe with "identical" something concerning linking was meant. Normally, hardlinking directories is not allowed, so this probably also isn't meant.

Anyway, I'd approach this task by creating checksums for all nodes in your tree, i. e. recursively:

  • For a directory take the names of all entries and their checksums (recursion) and compute a checksum of them,
  • for a plain file compute a checksum of its contents,
  • for symlinks and special files (devices, etc.) consider what you want (I'll leave this out).

After creating checksums for all elements, search for duplicates (by sorting a list of all and searching for consecutive lines).

A quick solution could be like this:

#!/bin/bash

dirchecksum() {
  if [ -f "$1" ]
  then
    checksum=$(md5sum < "$1")
  elif [ -d "$1" ]
  then
    checksum=$(
      find "$1" -maxdepth 1 -printf "%P " \( ! -path "$1" \) \
                -exec bash -c "dirchecksum {}" \; |
        md5sum
    )
  fi
  echo "$checksum"
  echo "$checksum $1" 1>&3
}
export -f dirchecksum

list=$(dirchecksum "$1" 3>&1 1>/dev/null)

lastChecksum=''
while read checksum _ path
do
  if [ "$checksum" = "$lastChecksum" ]
  then
    echo "duplicate found: $path = $lastPath"
  fi
  lastChecksum=$checksum
  lastPath=$path
done < <(sort <<< "$list")

This script uses two tricks which might not be clear, so I mention them:

  • To pass a shell function to find -exec one can export -f it (done below it) and then call bash -c ... to execute it.
  • The shell function has two output streams, one for returning the result checksum (this is via stdout, i. e. fd 1), and one for giving out each checksum found on the way to this (this is via fd 3).

The sorting at the end uses the list given out via fd 3 as input.

Upvotes: 1

Related Questions