Reputation: 20123
I am taking the following approach to computing similarity between multiple text file documents:
So far, I have implemented the first two steps. I am using Google Guava's MultiMap to associate hashed chunks with file names. As for step 3, I am now looking to do the following:
For each file {
get the file's hashed chunks
compare the hashed chunks with those of every other file
}
The idea here is to compare file signatures for common entries and report only clusters of those files whose signature intersection is above a given similarity threshold. But I ultimately want to tweak this logic to get a similarity score for each file as it is compared to a source file.
My question: What is the best way to compare hashed chunks for step 3? Also, how can I incorporate the concept of similarity with the former? Here is my code thus far:
public class ContentBasedChunkingMain implements Similarity {
Multimap<String, byte[]> hashMap = ArrayListMultimap.create();
public ContentBasedChunkingMain() {
}
@Override
public void getSimilarity(String dir) {
// Document directory
File directory = new File(dir);
Collection<File> collection = FileUtils.listFiles(directory,
TrueFileFilter.INSTANCE, TrueFileFilter.INSTANCE);
ArrayList<File> files = new ArrayList<File>(Arrays.asList(collection
.toArray(new File[0])));
// For each file in the directory
for (File f : files) {
// Get chunks
ArrayList<String> chunks = Chunker.getChunks(f);
for (String s : sentences) {
// MD5 Hash each sentence
MD5HashFunction h = new MD5HashFunction();
System.out.println(h.byteToHex(h.hash(s)));
// Store filename and hashed chunk into MultiMap
hashMap.put(f.getAbsolutePath(), h.hash(s));
}
}
// Step 3
for (File f : files) {
// for each file, get the file's hashed chunks
Collection<byte[]> bytes = hashMap.get(f.getAbsolutePath());
// compare ...
}
}
The algorithm that I am following, Content-Based Chunking Algorithm, is described in more detail in the following papers:
http://www.hpl.hp.com/techreports/2009/HPL-2009-90.pdf
http://webglimpse.net/pubs/TR93-33.pdf
http://www.hpl.hp.com/techreports/2005/HPL-2005-42R1.pdf
Upvotes: 0
Views: 803
Reputation: 20123
For the comparison, I took the intersection of the two hash chunk collections.
Upvotes: 1