Sambhav Sharma
Sambhav Sharma

Reputation: 5860

Remove duplicate files by filename in a directory (linux)

I have a directory structure like this

ARCHIVE_LOC -> epoch1 -> a.txt
                         b.txt

            -> epoch2 -> b.txt
                         c.txt

            -> epoch3 -> b.txt
                         c.txt

I have a base archive directory. This directory gets logs from android application via a rsync (at regular intervals), which are saved in directories based on the epoch/timestamp of rsync process. I want to remove all the duplicate log files(they have same name) and keep the latest ones. Any help on how to go about achieving this?

In a nutshell, I just want to keep the latest files of every file. One way of knowing which file is latest is the size of the file, since the size of the new file will always be greater than or equal to the older file.

Upvotes: 0

Views: 2185

Answers (3)

Per M.
Per M.

Reputation: 95

On Debian 7, I managed to come up with the following one-liner:

find path/to/folder -type f -name *.txt -printf '%Ts\t%p\n' | sort -nr | cut -f2 | perl -ne '/(\w+.txt)/; print if $seen{$&}++' | xargs rm

It is quite long and maybe there are shorter ways but it seems to do the trick. I have combined findings here

https://superuser.com/questions/608887/how-can-i-make-find-find-files-in-reverse-chronological-order

and here

Perl regular expression removing duplicate consecutive substrings in a string

Upvotes: 1

Aggarat .J
Aggarat .J

Reputation: 509

#!/bin/bash
declare -A arr
shopt -s globstar

for file in **; do
    [[ -f "$file" ]] || continue
    read cksm _ < <(md5sum "$file")
    if ((arr[$cksm]++)); then 
    echo "rm $file"
    fi
done   

[https://superuser.com/questions/386199/how-to-remove-duplicated-files-in-a-directory][1]

Upvotes: 0

Sambhav Sharma
Sambhav Sharma

Reputation: 5860

Wrote the following script, woks well for me.

# check base diectory provided exists
[ -e "$1" ] || {
    printf "\nError: invalid path. \n\n"
    exit 1
}

# find the files in base directory, sort them and filter out uniques, and iterate over the resulting list of files
# note: we're only filtering .json files here

for name in `find $1 -type f -printf "%f\n" | sort | uniq -d`; 
do  
    # we keep count of the duplicate files for a file to keep track of the last file(biggest in size)
    numDups=$(find $1 -name $name | wc -l); # number of duplicates found for a given file

for file in $(find $1 -name $name | sort -h); # sort the files again on basis of size
do

  if [ $numDups -ne 1 ];
  then
    if [ "$option" = -d ] # remove the duplicate file
    then
      rm $file
    else
      echo $file # if -d is not provided, just print the duplicate file names
      # note: this will print only the duplicate files, and not the latest/biggest file
    fi      
  fi
  numDups=$(($numDups-1))
  # note: as per current code, we are checking options value for each duplicate file
  # we can move the if conditions out of the for loop, but that would need duplication of code
  # we may try modifying the script otherwise, if we see serious performance issues.
  done
done;

exit 0;

Upvotes: 1

Related Questions