Reputation: 461
Say we have 2 folders each with 1000 files in it, and I need to check for similar words used in them.
dummy approach would be
for(File f : folderA){
for(File g : folderB){
compare
}
}
but this would unreasonable do many comparing and that takes memory and time. I wonder are there better ways to do this?
Upvotes: 0
Views: 93
Reputation: 4180
Just use a map. Note, depends on what you are trying to compare, modify the code accordingly.
Map<File,Integer> map = new HashMap<>();
for(File f : folderA){
Integer count = 0;
if(map.get(f)==null){
map.put(f,1);
}else{
count = map.get(f);
map.put(f,++count);
}
}
You can loop through the map and get the value of each element. The value of each map element, indicates how many similar items in your collection.
To loop through the map:
for (Map.Entry<File, Integer> entry : map.entrySet()) {
}
Big(O) is linear for this algorithm, pretty fast.
Upvotes: 1
Reputation: 5168
As I may add, if you're checking for similarities, not for identical words, I suggest you to calculate the doubleMetaphone see https://commons.apache.org/proper/commons-codec/apidocs/org/apache/commons/codec/language/DoubleMetaphone.html of all relevant words (remove articles like "the, this" and so on).
Upvotes: 0
Reputation: 18633
Depends on what you're trying to do.
You could create a Map
mapping File
s to the set of distinct words contained, and then you compare pairs of sets. Ideally, and assuming common sense data, that will take much less time than reading every pair of files.
Alternatively, you could have a Map
of words to the files containing them. So then, for each word, you'd know if it appears in more than one file.
Upvotes: 4