Reputation: 5860
I have a directory structure like this
ARCHIVE_LOC -> epoch1 -> a.txt
b.txt
-> epoch2 -> b.txt
c.txt
-> epoch3 -> b.txt
c.txt
I have a base archive directory. This directory gets logs from android application via a rsync (at regular intervals), which are saved in directories based on the epoch/timestamp of rsync process. I want to remove all the duplicate log files(they have same name) and keep the latest ones. Any help on how to go about achieving this?
In a nutshell, I just want to keep the latest files of every file. One way of knowing which file is latest is the size of the file, since the size of the new file will always be greater than or equal to the older file.
Upvotes: 0
Views: 2185
Reputation: 95
On Debian 7, I managed to come up with the following one-liner:
find path/to/folder -type f -name *.txt -printf '%Ts\t%p\n' | sort -nr | cut -f2 | perl -ne '/(\w+.txt)/; print if $seen{$&}++' | xargs rm
It is quite long and maybe there are shorter ways but it seems to do the trick. I have combined findings here
https://superuser.com/questions/608887/how-can-i-make-find-find-files-in-reverse-chronological-order
and here
Perl regular expression removing duplicate consecutive substrings in a string
Upvotes: 1
Reputation: 509
#!/bin/bash
declare -A arr
shopt -s globstar
for file in **; do
[[ -f "$file" ]] || continue
read cksm _ < <(md5sum "$file")
if ((arr[$cksm]++)); then
echo "rm $file"
fi
done
[https://superuser.com/questions/386199/how-to-remove-duplicated-files-in-a-directory][1]
Upvotes: 0
Reputation: 5860
Wrote the following script, woks well for me.
# check base diectory provided exists
[ -e "$1" ] || {
printf "\nError: invalid path. \n\n"
exit 1
}
# find the files in base directory, sort them and filter out uniques, and iterate over the resulting list of files
# note: we're only filtering .json files here
for name in `find $1 -type f -printf "%f\n" | sort | uniq -d`;
do
# we keep count of the duplicate files for a file to keep track of the last file(biggest in size)
numDups=$(find $1 -name $name | wc -l); # number of duplicates found for a given file
for file in $(find $1 -name $name | sort -h); # sort the files again on basis of size
do
if [ $numDups -ne 1 ];
then
if [ "$option" = -d ] # remove the duplicate file
then
rm $file
else
echo $file # if -d is not provided, just print the duplicate file names
# note: this will print only the duplicate files, and not the latest/biggest file
fi
fi
numDups=$(($numDups-1))
# note: as per current code, we are checking options value for each duplicate file
# we can move the if conditions out of the for loop, but that would need duplication of code
# we may try modifying the script otherwise, if we see serious performance issues.
done
done;
exit 0;
Upvotes: 1