Reputation: 21
I have two G729 encoded files, i took the pcm version of them. i want to measure the similarity between these two files. these files are binary files so how one can measure the similarity between binary files, i wrote a code in C that takes patterns from the first one and search for similar ones in the second one, but i want to have a similarity measure.... i searched a lot in the literature, i found jaccard and the others but still can't dtermine which of them is eligible to my case. Thank in advance for your help..
Upvotes: 2
Views: 2059
Reputation: 669
I had the same need and I came up with a solution that in my case work, but I cannot guaranty it is universal:
I took a library that creates the diff files. Given fileA and fileB this library creates a third file fileDiff that tell how to pass from fileA to fileB which bytes to copy and which to add. ( for more info about the format: http://www.w3.org/TR/NOTE-gdiff-19970901.html )
with a function I get a percentage. I know this is not 100% real, for example if u have fileB that is equal to half of fileA the similarity is of the function is 100%.
This is the DiffWriter implementation:
public class Distance implements DiffWriter {
private long newData = 0;
private long copiedData = 0;
@Override
public void flush() throws IOException {}
@Override
public void close() throws IOException {}
@Override
public void addData(byte arg0) throws IOException {
newData++;
}
@Override
public void addCopy(long arg0, int arg1) throws IOException {
copiedData += arg1;
}
public double getSimilarity() {
double a = (double) newData;
double c = (double) copiedData;
return (( c / (c + a) ) * 100.0);
}
}
Here is how I call it:
import com.nothome.delta.Delta;
File f1 = new File(...);
File f2 = new File(...);
Distance dw = new Distance();
try {
new Delta().compute(f1, f2, dw);
dw.getSimilarity();
} catch (Exception e) {
e.printStackTrace();
}
Upvotes: 0
Reputation: 70701
Since you mention the files are audio files, it would be better to define a similarity measure based on audio characteristics rather than simply doing a binary comparison. A quick search brought up a research project called MusicMiner that you may want to look into for further ideas.
Upvotes: 2