Camsoft
Camsoft

Reputation: 12005

How to find duplicate files with same name but in different case that exist in same directory in Linux?

How can I return a list of files that are named duplicates i.e. have same name but in different case that exist in the same directory?

I don't care about the contents of the files. I just need to know the location and name of any files that have a duplicate of the same name.

Example duplicates:

/www/images/taxi.jpg
/www/images/Taxi.jpg

Ideally I need to search all files recursively from a base directory. In above example it was /www/

Upvotes: 35

Views: 48559

Answers (11)

fedorqui
fedorqui

Reputation: 289595

You can check duplicates in a given directory with GNU awk:

gawk 'BEGINFILE {if ((seen[tolower(FILENAME)]++)) print FILENAME; nextfile}' *

This uses BEGINFILE to perform some action before going on and reading a file. In this case, it keeps track of the names that have appeared in an array seen[] whose indexes are the names of the files in lowercase.

If a name has already appeared, no matter its case, it prints it. Otherwise, it just jumps to the next file.


See an example:

$ tree
.
├── bye.txt
├── hello.txt
├── helLo.txt
├── yeah.txt
└── YEAH.txt

0 directories, 5 files
$ gawk 'BEGINFILE {if ((a[tolower(FILENAME)]++)) print FILENAME; nextfile}' *
helLo.txt
YEAH.txt

Upvotes: 0

user3119102
user3119102

Reputation:

You can use:

find -type f  -exec readlink -m {} \; | gawk 'BEGIN{FS="/";OFS="/"}{$NF=tolower($NF);print}' | uniq -c

Where:

  • find -type f
    recursion print all file's full path.

  • -exec readlink -m {} \;
    get file's absolute path

  • gawk 'BEGIN{FS="/";OFS="/"}{$NF=tolower($NF);print}'
    replace the all filename's to lower case

  • uniq -c
    unique the path, -c output the count of duplicate.

Upvotes: 1

Bah
Bah

Reputation: 1

I just used fdupes on CentOS to clean up a whole buncha duplicate files...

yum install fdupes

Upvotes: -2

noclayto
noclayto

Reputation: 160

Here is an example how to find all duplicate jar files:

find . -type f -printf "%f\n" -name "*.jar" | sort -f | uniq -i -d

Replace *.jar with whatever duplicate file type you are looking for.

Upvotes: 2

The other answer is great, but instead of the "rather monstrous" perl script i suggest

perl -pe 's!([^/]+)$!lc $1!e'

Which will lowercase just the filename part of the path.

Edit 1: In fact the entire problem can be solved with:

find . | perl -ne 's!([^/]+)$!lc $1!e; print if 1 == $seen{$_}++'

Edit 3: I found a solution using sed, sort and uniq that also will print out the duplicates, but it only works if there are no whitespaces in filenames:

find . |sed 's,\(.*\)/\(.*\)$,\1/\2\t\1/\L\2,'|sort|uniq -D -f 1|cut -f 1

Edit 2: And here is a longer script that will print out the names, it takes a list of paths on stdin, as given by find. Not so elegant, but still:

#!/usr/bin/perl -w

use strict;
use warnings;

my %dup_series_per_dir;
while (<>) {
    my ($dir, $file) = m!(.*/)?([^/]+?)$!;
    push @{$dup_series_per_dir{$dir||'./'}{lc $file}}, $file;
}

for my $dir (sort keys %dup_series_per_dir) {
    my @all_dup_series_in_dir = grep { @{$_} > 1 } values %{$dup_series_per_dir{$dir}};
    for my $one_dup_series (@all_dup_series_in_dir) {
        print "$dir\{" . join(',', sort @{$one_dup_series}) . "}\n";
    }
}

Upvotes: 44

serg10
serg10

Reputation: 32667

Little bit late to this one, but here's the version I went with:

find . -type f | awk -F/ '{print $NF}' | sort -f | uniq -i -d

Here we are using:

  1. find - find all files under the current dir
  2. awk - remove the file path part of the filename
  3. sort - sort case insensitively
  4. uniq - find the dupes from what makes it through the pipe

(Inspired by @mpez0 answer, and @SimonDowdles comment on @paxdiablo answer.)

Upvotes: 0

crafter
crafter

Reputation: 6296

Here's a script that worked for me ( I am not the author). the original and discussion can be found here: http://www.daemonforums.org/showthread.php?t=4661

#! /bin/sh

# find duplicated files in directory tree
# comparing by file NAME, SIZE or MD5 checksum
# --------------------------------------------
# LICENSE(s): BSD / CDDL
# --------------------------------------------
# vermaden [AT] interia [DOT] pl
# http://strony.toya.net.pl/~vermaden/links.htm

__usage() {
  echo "usage: $( basename ${0} ) OPTION DIRECTORY"
  echo "  OPTIONS: -n   check by name (fast)"
  echo "           -s   check by size (medium)"
  echo "           -m   check by md5  (slow)"
  echo "           -N   same as '-n' but with delete instructions printed"
  echo "           -S   same as '-s' but with delete instructions printed"
  echo "           -M   same as '-m' but with delete instructions printed"
  echo "  EXAMPLE: $( basename ${0} ) -s /mnt"
  exit 1
  }

__prefix() {
  case $( id -u ) in
    (0) PREFIX="rm -rf" ;;
    (*) case $( uname ) in
          (SunOS) PREFIX="pfexec rm -rf" ;;
          (*)     PREFIX="sudo rm -rf"   ;;
        esac
        ;;
  esac
  }

__crossplatform() {
  case $( uname ) in
    (FreeBSD)
      MD5="md5 -r"
      STAT="stat -f %z"
      ;;
    (Linux)
      MD5="md5sum"
      STAT="stat -c %s"
      ;;
    (SunOS)
      echo "INFO: supported systems: FreeBSD Linux"
      echo
      echo "Porting to Solaris/OpenSolaris"
      echo "  -- provide values for MD5/STAT in '$( basename ${0} ):__crossplatform()'"
      echo "  -- use digest(1) instead for md5 sum calculation"
      echo "       $ digest -a md5 file"
      echo "  -- pfexec(1) is already used in '$( basename ${0} ):__prefix()'"
      echo
      exit 1
    (*)
      echo "INFO: supported systems: FreeBSD Linux"
      exit 1
      ;;
  esac
  }

__md5() {
  __crossplatform
  :> ${DUPLICATES_FILE}
  DATA=$( find "${1}" -type f -exec ${MD5} {} ';' | sort -n )
  echo "${DATA}" \
    | awk '{print $1}' \
    | uniq -c \
    | while read LINE
      do
        COUNT=$( echo ${LINE} | awk '{print $1}' )
        [ ${COUNT} -eq 1 ] && continue
        SUM=$( echo ${LINE} | awk '{print $2}' )
        echo "${DATA}" | grep ${SUM} >> ${DUPLICATES_FILE}
      done

  echo "${DATA}" \
    | awk '{print $1}' \
    | sort -n \
    | uniq -c \
    | while read LINE
      do
        COUNT=$( echo ${LINE} | awk '{print $1}' )
        [ ${COUNT} -eq 1 ] && continue
        SUM=$( echo ${LINE} | awk '{print $2}' )
        echo "count: ${COUNT} | md5: ${SUM}"
        grep ${SUM} ${DUPLICATES_FILE} \
          | cut -d ' ' -f 2-10000 2> /dev/null \
          | while read LINE
            do
              if [ -n "${PREFIX}" ]
              then
                echo "  ${PREFIX} \"${LINE}\""
              else
                echo "  ${LINE}"
              fi
            done
        echo
      done
  rm -rf ${DUPLICATES_FILE}
  }

__size() {
  __crossplatform
  find "${1}" -type f -exec ${STAT} {} ';' \
    | sort -n \
    | uniq -c \
    | while read LINE
      do
        COUNT=$( echo ${LINE} | awk '{print $1}' )
        [ ${COUNT} -eq 1 ] && continue
        SIZE=$( echo ${LINE} | awk '{print $2}' )
        SIZE_KB=$( echo ${SIZE} / 1024 | bc )
        echo "count: ${COUNT} | size: ${SIZE_KB}KB (${SIZE} bytes)"
        if [ -n "${PREFIX}" ]
        then
          find ${1} -type f -size ${SIZE}c -exec echo "  ${PREFIX} \"{}\"" ';'
        else
          # find ${1} -type f -size ${SIZE}c -exec echo "  {}  " ';'  -exec du -h "  {}" ';'
          find ${1} -type f -size ${SIZE}c -exec echo "  {}  " ';'
        fi
        echo
      done
  }

__file() {
  __crossplatform
  find "${1}" -type f \
    | xargs -n 1 basename 2> /dev/null \
    | tr '[A-Z]' '[a-z]' \
    | sort -n \
    | uniq -c \
    | sort -n -r \
    | while read LINE
      do
        COUNT=$( echo ${LINE} | awk '{print $1}' )
        [ ${COUNT} -eq 1 ] && break
        FILE=$( echo ${LINE} | cut -d ' ' -f 2-10000 2> /dev/null )
        echo "count: ${COUNT} | file: ${FILE}"
        FILE=$( echo ${FILE} | sed -e s/'\['/'\\\['/g -e s/'\]'/'\\\]'/g )
        if [ -n "${PREFIX}" ]
        then
          find ${1} -iname "${FILE}" -exec echo "  ${PREFIX} \"{}\"" ';'
        else
          find ${1} -iname "${FILE}" -exec echo "  {}" ';'
        fi
        echo
      done 
  }

# main()

[ ${#} -ne 2  ] && __usage
[ ! -d "${2}" ] && __usage

DUPLICATES_FILE="/tmp/$( basename ${0} )_DUPLICATES_FILE.tmp"

case ${1} in
  (-n)           __file "${2}" ;;
  (-m)           __md5  "${2}" ;;
  (-s)           __size "${2}" ;;
  (-N) __prefix; __file "${2}" ;;
  (-M) __prefix; __md5  "${2}" ;;
  (-S) __prefix; __size "${2}" ;;
  (*)  __usage ;;
esac

If the find command is not working for you, you may have to change it. For example

OLD :   find "${1}" -type f | xargs -n 1 basename 
NEW :   find "${1}" -type f -printf "%f\n"

Upvotes: 1

Alain
Alain

Reputation: 41

Following up on the response of mpez0, to detect recursively just replace "ls" by "find .". The only problem I see with this is that if this is a directory that is duplicating, then you have 1 entry for each files in this directory. Some human brain is required to treat the output of this.

But anyway, you're not automatically deleting these files, are you?

find . | sort -f | uniq -i -d

Upvotes: 4

user1639307
user1639307

Reputation: 21

This is a nice little command line app called findsn you get if you compile fslint that the deb package does not include.

it will find any files with the same name, and its lightning fast and it can handle different case.

/findsn --help
find (files) with duplicate or conflicting names.
Usage: findsn [-A -c -C] [[-r] [-f] paths(s) ...]

If no arguments are supplied the $PATH is searched for any redundant or conflicting files.

-A  reports all aliases (soft and hard links) to files.
    If no path(s) specified then the $PATH is searched.

If only path(s) specified then they are checked for duplicate named files. You can qualify this with -C to ignore case in this search. Qualifying with -c is more restrictive as only files (or directories) in the same directory whose names differ only in case are reported. I.E. -c will flag files & directories that will conflict if transfered to a case insensitive file system. Note if -c or -C specified and no path(s) specified the current directory is assumed.

Upvotes: 2

mpez0
mpez0

Reputation: 2883

I believe

ls | sort -f | uniq -i -d

is simpler, faster, and will give the same result

Upvotes: 5

paxdiablo
paxdiablo

Reputation: 881353

Try:

ls -1 | tr '[A-Z]' '[a-z]' | sort | uniq -c | grep -v " 1 "

Simple, really :-) Aren't pipelines wonderful beasts?

The ls -1 gives you the files one per line, the tr '[A-Z]' '[a-z]' converts all uppercase to lowercase, the sort sorts them (surprisingly enough), uniq -c removes subsequent occurrences of duplicate lines whilst giving you a count as well and, finally, the grep -v " 1 " strips out those lines where the count was one.

When I run this in a directory with one "duplicate" (I copied qq to qQ), I get:

2 qq

For the "this directory and every subdirectory" version, just replace ls -1 with find . or find DIRNAME if you want a specific directory starting point (DIRNAME is the directory name you want to use).

This returns (for me):

2 ./.gconf/system/gstreamer/0.10/audio/profiles/mp3
2 ./.gconf/system/gstreamer/0.10/audio/profiles/mp3/%gconf.xml
2 ./.gnome2/accels/blackjack
2 ./qq

which are caused by:

pax> ls -1d .gnome2/accels/[bB]* .gconf/system/gstreamer/0.10/audio/profiles/[mM]* [qQ]?
.gconf/system/gstreamer/0.10/audio/profiles/mp3
.gconf/system/gstreamer/0.10/audio/profiles/MP3
.gnome2/accels/blackjack
.gnome2/accels/Blackjack
qq
qQ

Update:

Actually, on further reflection, the tr will lowercase all components of the path so that both of

/a/b/c
/a/B/c

will be considered duplicates even though they're in different directories.

If you only want duplicates within a single directory to show as a match, you can use the (rather monstrous):

perl -ne '
    chomp;
    @flds = split (/\//);
    $lstf = $f[-1];
    $lstf =~ tr/A-Z/a-z/;
    for ($i =0; $i ne $#flds; $i++) {
        print "$f[$i]/";
    };
    print "$x\n";'

in place of:

tr '[A-Z]' '[a-z]'

What it does is to only lowercase the final portion of the pathname rather than the whole thing. In addition, if you only want regular files (no directories, FIFOs and so forth), use find -type f to restrict what's returned.

Upvotes: 37

Related Questions