I hope you can help me with the following problem. I have 24 directories each containing many (1000's) of files. I would like to find out which combination of directories contains the most number of duplicate (by name only) files. For example if we only consider 4 directories dir1 dir2 dir3 dir4 with the following directory contents dir1 1.fa 2.fa 3.fa 4.fa 5.fa dir2 1.fa 10.fa 15.fa dir3 1.fa 2.fa 3.fa dir4 1.fa 2.fa 3.fa 5.fa 8.fa 10.fa Therefore, the combination of directories dir1 and dir4 contain the most duplicate files (4). The problem becomes quite large with 24 directories so I was thinking that I might use a brute force approach. Something along the lines of count all duplicate files that occur in all 24 directories drop a directory and count the number of duplicate files replace the directory and drop another one then count number repeat for all directories get the subset of 23 directories with max number of duplicate files repeat the above 2-5 and keep the 22 directories with most duplicate files repeat until only 2 directories left choose the combination of directories with the max number of duplicate files If any one has a way of doing this I would be very grateful for some advice. I thought of using fdupes or diff but cant figure out how to parse the output and summarise.

Reputation: 125

find files with same name in different directories and count duplicates

I hope you can help me with the following problem. I have 24 directories each containing many (1000's) of files. I would like to find out which combination of directories contains the most number of duplicate (by name only) files. For example if we only consider 4 directories

dir1 dir2 dir3 dir4

with the following directory contents

dir1

1.fa 2.fa 3.fa 4.fa 5.fa

dir2

1.fa 10.fa 15.fa

dir3

1.fa 2.fa 3.fa

dir4

1.fa 2.fa 3.fa 5.fa 8.fa 10.fa

Therefore, the combination of directories dir1 and dir4 contain the most duplicate files (4).

The problem becomes quite large with 24 directories so I was thinking that I might use a brute force approach. Something along the lines of

count all duplicate files that occur in all 24 directories
drop a directory and count the number of duplicate files
replace the directory and drop another one then count number
repeat for all directories
get the subset of 23 directories with max number of duplicate files
repeat the above 2-5 and keep the 22 directories with most duplicate files
repeat until only 2 directories left
choose the combination of directories with the max number of duplicate files

If any one has a way of doing this I would be very grateful for some advice. I thought of using fdupes or diff but cant figure out how to parse the output and summarise.

Upvotes: 3

Answers (5)

Peter Schneider

Reputation: 1711

Just for curiosity, I've done some simple tests: 24 directories with approximately 3900 files in each (a random number between 0 and 9999). Both bash-scripts take around 10 seconds each. Here is a basic python-script doing the same in ~0.2s:

#!/usr//bin/python

import sys, os

def get_max_duplicates(path):
    items = [(d,set(os.listdir(os.path.join(path,d)))) \
        for d in os.listdir(path) if os.path.isdir(os.path.join(path, d))]
    if len(items) < 2: 
        # need at least two directories
        return ("","",0)
    values = [(items[i][0],items[j][0],len(items[i][1].intersection(items[j][1]))) \
        for i in range(len(items)) for j in range(i+1, len(items))]
    return max(values, key=lambda a: a[2])


def main():
    path = sys.argv[1] if len(sys.argv)==2 else os.getcwd()
    r = get_max_duplicates(path)
    print "%s and %s share %d files" % r

if __name__ == '__main__':
    main()

As mentioned by Richard, by using a hash-table (or set in python), we can speed things up. The intersection of two sets is O(min(len(set_a), len(set_b))) and we have to do N(N-1)/2=720 comparisons.

Upvotes: 0

Richard

Reputation: 546

Can we create hash table for all of these 24 directories? If the filename is just number , the hash function will be very easy to design.

If we can use hash table, it will be faster to search and find duplication.

Upvotes: 0

Don Branson

Reputation: 13707

./count_dups.sh:

1 files are duplicated Comparing dir1 to dir2.
3 files are duplicated Comparing dir1 to dir3.
4 files are duplicated Comparing dir1 to dir4.
1 files are duplicated Comparing dir2 to dir3.
2 files are duplicated Comparing dir2 to dir4.
3 files are duplicated Comparing dir3 to dir4.

./count_dups.sh | sort -n | tail -1

4 files are duplicated Comparing dir1 to dir4.

Using the script count_dups.sh:

#!/bin/bash

# This assumes (among other things) that the dirs don't have spaces in the names

cd testdirs
declare -a DIRS=(`ls`);

function count_dups {
    DUPS=`ls $1 $2 | sort | uniq -d | wc -l`
    echo "$DUPS files are duplicated comparing $1 to $2."
}

LEFT=0
while [ $LEFT -lt ${#DIRS[@]} ] ; do
    RIGHT=$(( $LEFT + 1 ))
    while [ $RIGHT -lt ${#DIRS[@]} ] ; do
        count_dups ${DIRS[$LEFT]} ${DIRS[$RIGHT]}
        RIGHT=$(( $RIGHT + 1 ))
    done
    LEFT=$(( $LEFT + 1 ))
done

Upvotes: 0

Olaf Dietsche

Reputation: 74098

Count duplicate file names in shell:

#! /bin/sh

# directories to test for
dirs='dir1 dir2 dir3 dir4'

# directory pairs already seen
seen=''

for d1 in $dirs; do
    for d2 in $dirs; do
        if echo $seen | grep -q -e " $d1:$d2;" -e " $d2:$d1;"; then
            : # don't count twice
        elif test $d1 != $d2; then
            # remember pair of directories
            seen="$seen $d1:$d2;"
            # count duplicates
            ndups=`ls $d1 $d2 | sort | uniq -c | awk '$1 > 1' | wc -l`
            echo "$d1:$d2 $ndups"
        fi
    done
# sort decreasing and take the first
done | sort -k 2rn | head -1

Upvotes: 0

sampson-chen

Reputation: 47367

I tagged your question with algorithm as I am unaware of any existing bash / linux tools that can help you directly solve this problem. The easiest way would be to construct algorithm for this in a programming language such as Python, C++, or Java instead of using bash shells.

That being said, here's a high level analysis of your problem: At first glance it looks like a mininum set cover problem, but it's actually broken down into 2 parts:

Part 1 - What is the set of files to cover?

You want to find the combination of directories that cover the most number of duplicate files. But first you need to know what the maximum set of duplicate files are within your 24 directories.

Since the intersection of files between 2 directories is always greater than or equal to the intersection with a 3rd directory, you go through all pairs of directories and find what the maximum intersection set is:

(24 choose 2) = 276 comparisons

You take the largest intersection set found and use that as the set you are actually trying to cover.

Part 2 - The minimum set cover problem

This is a well-studied problem in computer science, so you are better served reading from the writings of people much smarter than I.

The only thing I have to note that it's a NP-Complete problem, so it's not trivial.

This is the best I can do to address the original formulation of your question, but I have a feeling that it's overkill for what you actually need to accomplish. You should consider updating your question with the actual problem that you need to solve.

Upvotes: 4

find files with same name in different directories and count duplicates

Answers (5)

Related Questions