bruchowski
bruchowski

Reputation: 5331

best way to find files recursively that have the same name but are actually different using bash?

I have about 15000 images in nested file structure whose names are SKUS. I need to make sure that there are no files with the same SKU that are actually different files.

For example, if I have two or more files named: MYSKU.jpg, I need to make sure that none of them are different from each other.

What's the best way to do that in a bash command?

Upvotes: 4

Views: 505

Answers (3)

kojiro
kojiro

Reputation: 77099

Here's how I would tackle it with bash 4:

#!/usr/local/bin/bash -vx

#!/usr/local/bin/bash -vx

shopt -s globstar # turn on recursive globbing
shopt -s nullglob # hide globs that don't match anything
shopt -s nocaseglob # match globs regardless of capitalization

images=( **/*.{gif,jpeg,jpg,png} ) # all the image files
declare -A homonyms # associative array of like named files

for i in "${!images[@]}"; do # iterate over indices
    base=${images[i]##*/} # file name without path
    homonyms["$base"]+="$i " # Space delimited list of indices for this basename
done

for base in "${!homonyms[@]}"; do # distinct basenames
    unset dupehashes; declare -A dupehashes # temporary var for hashes
    indices=( ${homonyms["$base"]} ) # omit quotes to allow expansion of space-delimited integers
    (( ${#indices[@]} > 1 )) || continue # ignore unique names
    for i in "${indices[@]}"; do
        dupehashes[$(md5 < "${images[i]}")]+="$i "
    done

    (( ${#dupehashes[@]} > 1 )) || continue # ignore if same hash
    echo
    printf 'The following files have different hashes: '
    for h in "${!dupehashes[@]}"; do
        for i in ${dupehashes[$h]}; do # omit quotes to expand space-delimited integer list
            printf '%s %s\n' "$h" "${images[i]}"
        done
    done
done

I know the above looks like a lot, but I think with 15k images you really want to avoid open()ing and checksumming ones you don't have to, so this approach is tuned to narrow the dataset down to duplicate filenames and only then hash the contents. As others have said earlier, you can make this even faster by checking file sizes before hashing, but I'll leave that part unfinished.

Upvotes: 0

Bechir
Bechir

Reputation: 1051

The idea is to scan directory for all files and check which one have the same name but different content based on the md5 checksum

#!/bin/bash

# directory to scan
scan_dir=$1

[ ! -d "$1" ] && echo "Usage $0 <scan dir>" && exit 1

# Associative array to save hash table
declare -A HASH_TABLE
# Associative array of full path of items
declare -A FULL_PATH


for item in $( find $scan_dir -type f ) ; do
    file=$(basename $item)
    md5=$(md5sum $item | cut -f1 -d\ )
    if [ -z "${HASH_TABLE[$file]}" ] ; then
        HASH_TABLE[$file]=$md5
        FULL_PATH[$file]=$item
    else
        if [ "${HASH_TABLE[$file]}" != "$md5" ] ; then
            echo "differ $item from ${FULL_PATH[$file]}"
        fi
    fi
done

Usage (assume that you name the script file as scan_dir.sh :

$ ./scan_dir.sh /path/to/you/directory

Upvotes: 1

Pavel
Pavel

Reputation: 7552

I don't want to solve the task for you completely, but here are some useful ingredients that you can try and integrate:

find /path -type f   # gives you a list of all files in /path

you can iterate through the list like this

for f in $(find /path -type f -name '*.jpg'); do
  ...
done

now you can think of things you need to collect within the loop. I'd suggest

base=$(basename $f)
full_path=$f
hash=$(echo $f | md5sum | awk '{print $1}')

you can now store this information in three columns in a file, so that each line contains all you need to know about a file to find the duplicates.

since you didn't explain how you need to deal with the duplicates, here's just a suggestion how to spot them. then it's up to you what to do with them.

given the list we obtained above, you can store two copies of it: one is just sorted by basename, the other is sorted by basename excluding duplicates:

sort -k2    list.txt | column -t > list.sorted.txt       
sort -k2 -u list.txt | column -t > list.sorted.uniq.txt

here I assume the basename is in the second column

now run

diff list.sorted.txt list.sorted.uniq.txt

to see the files that have the same name. from each row you can now extract the MD5 checksum to verify if they're really different and also the full path in order to perform some action like mv, rm, ln etc.

Upvotes: 3

Related Questions