Reputation: 5331
I have about 15000 images in nested file structure whose names are SKUS. I need to make sure that there are no files with the same SKU that are actually different files.
For example, if I have two or more files named: MYSKU.jpg
, I need to make sure that none of them are different from each other.
What's the best way to do that in a bash command?
Upvotes: 4
Views: 505
Reputation: 77099
Here's how I would tackle it with bash 4:
#!/usr/local/bin/bash -vx
#!/usr/local/bin/bash -vx
shopt -s globstar # turn on recursive globbing
shopt -s nullglob # hide globs that don't match anything
shopt -s nocaseglob # match globs regardless of capitalization
images=( **/*.{gif,jpeg,jpg,png} ) # all the image files
declare -A homonyms # associative array of like named files
for i in "${!images[@]}"; do # iterate over indices
base=${images[i]##*/} # file name without path
homonyms["$base"]+="$i " # Space delimited list of indices for this basename
done
for base in "${!homonyms[@]}"; do # distinct basenames
unset dupehashes; declare -A dupehashes # temporary var for hashes
indices=( ${homonyms["$base"]} ) # omit quotes to allow expansion of space-delimited integers
(( ${#indices[@]} > 1 )) || continue # ignore unique names
for i in "${indices[@]}"; do
dupehashes[$(md5 < "${images[i]}")]+="$i "
done
(( ${#dupehashes[@]} > 1 )) || continue # ignore if same hash
echo
printf 'The following files have different hashes: '
for h in "${!dupehashes[@]}"; do
for i in ${dupehashes[$h]}; do # omit quotes to expand space-delimited integer list
printf '%s %s\n' "$h" "${images[i]}"
done
done
done
I know the above looks like a lot, but I think with 15k images you really want to avoid open()
ing and checksumming ones you don't have to, so this approach is tuned to narrow the dataset down to duplicate filenames and only then hash the contents. As others have said earlier, you can make this even faster by checking file sizes before hashing, but I'll leave that part unfinished.
Upvotes: 0
Reputation: 1051
The idea is to scan directory for all files and check which one have the same name but different content based on the md5 checksum
#!/bin/bash
# directory to scan
scan_dir=$1
[ ! -d "$1" ] && echo "Usage $0 <scan dir>" && exit 1
# Associative array to save hash table
declare -A HASH_TABLE
# Associative array of full path of items
declare -A FULL_PATH
for item in $( find $scan_dir -type f ) ; do
file=$(basename $item)
md5=$(md5sum $item | cut -f1 -d\ )
if [ -z "${HASH_TABLE[$file]}" ] ; then
HASH_TABLE[$file]=$md5
FULL_PATH[$file]=$item
else
if [ "${HASH_TABLE[$file]}" != "$md5" ] ; then
echo "differ $item from ${FULL_PATH[$file]}"
fi
fi
done
Usage (assume that you name the script file as scan_dir.sh
:
$ ./scan_dir.sh /path/to/you/directory
Upvotes: 1
Reputation: 7552
I don't want to solve the task for you completely, but here are some useful ingredients that you can try and integrate:
find /path -type f # gives you a list of all files in /path
you can iterate through the list like this
for f in $(find /path -type f -name '*.jpg'); do
...
done
now you can think of things you need to collect within the loop. I'd suggest
base=$(basename $f)
full_path=$f
hash=$(echo $f | md5sum | awk '{print $1}')
you can now store this information in three columns in a file, so that each line contains all you need to know about a file to find the duplicates.
since you didn't explain how you need to deal with the duplicates, here's just a suggestion how to spot them. then it's up to you what to do with them.
given the list we obtained above, you can store two copies of it: one is just sorted by basename, the other is sorted by basename excluding duplicates:
sort -k2 list.txt | column -t > list.sorted.txt
sort -k2 -u list.txt | column -t > list.sorted.uniq.txt
here I assume the basename is in the second column
now run
diff list.sorted.txt list.sorted.uniq.txt
to see the files that have the same name. from each row you can now extract the MD5 checksum to verify if they're really different and also the full path in order to perform some action like mv
, rm
, ln
etc.
Upvotes: 3