kenlz
kenlz

Reputation: 461

Compare thousand of files efficiently Java

Say we have 2 folders each with 1000 files in it, and I need to check for similar words used in them.

dummy approach would be

for(File f : folderA){
    for(File g : folderB){
        compare
    }
}

but this would unreasonable do many comparing and that takes memory and time. I wonder are there better ways to do this?

Upvotes: 0

Views: 93

Answers (3)

OPK
OPK

Reputation: 4180

Just use a map. Note, depends on what you are trying to compare, modify the code accordingly.

Map<File,Integer> map = new HashMap<>();
for(File f : folderA){
    Integer count = 0;
    if(map.get(f)==null){
        map.put(f,1);
    }else{
        count = map.get(f);
        map.put(f,++count);
    }
}

You can loop through the map and get the value of each element. The value of each map element, indicates how many similar items in your collection.

To loop through the map:

    for (Map.Entry<File, Integer> entry : map.entrySet()) {

    }

Big(O) is linear for this algorithm, pretty fast.

Upvotes: 1

JFPicard
JFPicard

Reputation: 5168

As I may add, if you're checking for similarities, not for identical words, I suggest you to calculate the doubleMetaphone see https://commons.apache.org/proper/commons-codec/apidocs/org/apache/commons/codec/language/DoubleMetaphone.html of all relevant words (remove articles like "the, this" and so on).

Upvotes: 0

Vlad
Vlad

Reputation: 18633

Depends on what you're trying to do.

You could create a Map mapping Files to the set of distinct words contained, and then you compare pairs of sets. Ideally, and assuming common sense data, that will take much less time than reading every pair of files.

Alternatively, you could have a Map of words to the files containing them. So then, for each word, you'd know if it appears in more than one file.

Upvotes: 4

Related Questions