Reputation: 125
I hope you can help me with the following problem. I have 24 directories each containing many (1000's) of files. I would like to find out which combination of directories contains the most number of duplicate (by name only) files. For example if we only consider 4 directories
dir1 dir2 dir3 dir4
with the following directory contents
dir1
1.fa 2.fa 3.fa 4.fa 5.fa
dir2
1.fa 10.fa 15.fa
dir3
1.fa 2.fa 3.fa
dir4
1.fa 2.fa 3.fa 5.fa 8.fa 10.fa
Therefore, the combination of directories dir1 and dir4 contain the most duplicate files (4).
The problem becomes quite large with 24 directories so I was thinking that I might use a brute force approach. Something along the lines of
If any one has a way of doing this I would be very grateful for some advice. I thought of using fdupes
or diff
but cant figure out how to parse the output and summarise.
Upvotes: 3
Views: 3259
Reputation: 1711
Just for curiosity, I've done some simple tests: 24 directories with approximately 3900 files in each (a random number between 0 and 9999). Both bash-scripts take around 10 seconds each. Here is a basic python-script doing the same in ~0.2s:
#!/usr//bin/python
import sys, os
def get_max_duplicates(path):
items = [(d,set(os.listdir(os.path.join(path,d)))) \
for d in os.listdir(path) if os.path.isdir(os.path.join(path, d))]
if len(items) < 2:
# need at least two directories
return ("","",0)
values = [(items[i][0],items[j][0],len(items[i][1].intersection(items[j][1]))) \
for i in range(len(items)) for j in range(i+1, len(items))]
return max(values, key=lambda a: a[2])
def main():
path = sys.argv[1] if len(sys.argv)==2 else os.getcwd()
r = get_max_duplicates(path)
print "%s and %s share %d files" % r
if __name__ == '__main__':
main()
As mentioned by Richard, by using a hash-table (or set in python), we can speed things up. The intersection of two sets is O(min(len(set_a), len(set_b))) and we have to do N(N-1)/2=720
comparisons.
Upvotes: 0
Reputation: 546
Can we create hash table for all of these 24 directories? If the filename is just number , the hash function will be very easy to design.
If we can use hash table, it will be faster to search and find duplication.
Upvotes: 0
Reputation: 13707
./count_dups.sh:
1 files are duplicated Comparing dir1 to dir2.
3 files are duplicated Comparing dir1 to dir3.
4 files are duplicated Comparing dir1 to dir4.
1 files are duplicated Comparing dir2 to dir3.
2 files are duplicated Comparing dir2 to dir4.
3 files are duplicated Comparing dir3 to dir4.
./count_dups.sh | sort -n | tail -1
4 files are duplicated Comparing dir1 to dir4.
Using the script count_dups.sh:
#!/bin/bash
# This assumes (among other things) that the dirs don't have spaces in the names
cd testdirs
declare -a DIRS=(`ls`);
function count_dups {
DUPS=`ls $1 $2 | sort | uniq -d | wc -l`
echo "$DUPS files are duplicated comparing $1 to $2."
}
LEFT=0
while [ $LEFT -lt ${#DIRS[@]} ] ; do
RIGHT=$(( $LEFT + 1 ))
while [ $RIGHT -lt ${#DIRS[@]} ] ; do
count_dups ${DIRS[$LEFT]} ${DIRS[$RIGHT]}
RIGHT=$(( $RIGHT + 1 ))
done
LEFT=$(( $LEFT + 1 ))
done
Upvotes: 0
Reputation: 74098
Count duplicate file names in shell:
#! /bin/sh
# directories to test for
dirs='dir1 dir2 dir3 dir4'
# directory pairs already seen
seen=''
for d1 in $dirs; do
for d2 in $dirs; do
if echo $seen | grep -q -e " $d1:$d2;" -e " $d2:$d1;"; then
: # don't count twice
elif test $d1 != $d2; then
# remember pair of directories
seen="$seen $d1:$d2;"
# count duplicates
ndups=`ls $d1 $d2 | sort | uniq -c | awk '$1 > 1' | wc -l`
echo "$d1:$d2 $ndups"
fi
done
# sort decreasing and take the first
done | sort -k 2rn | head -1
Upvotes: 0
Reputation: 47367
I tagged your question with algorithm
as I am unaware of any existing bash / linux tools that can help you directly solve this problem. The easiest way would be to construct algorithm for this in a programming language such as Python, C++, or Java instead of using bash shells.
That being said, here's a high level analysis of your problem: At first glance it looks like a mininum set cover problem, but it's actually broken down into 2 parts:
Part 1 - What is the set of files to cover?
You want to find the combination of directories that cover the most number of duplicate files. But first you need to know what the maximum set of duplicate files are within your 24 directories.
Since the intersection of files between 2 directories is always greater than or equal to the intersection with a 3rd directory, you go through all pairs of directories and find what the maximum intersection set is:
(24 choose 2) = 276 comparisons
You take the largest intersection set found and use that as the set you are actually trying to cover.
Part 2 - The minimum set cover problem
This is a well-studied problem in computer science, so you are better served reading from the writings of people much smarter than I.
The only thing I have to note that it's a NP-Complete problem, so it's not trivial.
This is the best I can do to address the original formulation of your question, but I have a feeling that it's overkill for what you actually need to accomplish. You should consider updating your question with the actual problem that you need to solve.
Upvotes: 4