Dobs
Dobs

Reputation: 21

Measuring the similarity between two binary files?

I have two G729 encoded files, i took the pcm version of them. i want to measure the similarity between these two files. these files are binary files so how one can measure the similarity between binary files, i wrote a code in C that takes patterns from the first one and search for similar ones in the second one, but i want to have a similarity measure.... i searched a lot in the literature, i found jaccard and the others but still can't dtermine which of them is eligible to my case. Thank in advance for your help..

Upvotes: 2

Views: 2059

Answers (2)

hurtledown
hurtledown

Reputation: 669

I had the same need and I came up with a solution that in my case work, but I cannot guaranty it is universal:

I took a library that creates the diff files. Given fileA and fileB this library creates a third file fileDiff that tell how to pass from fileA to fileB which bytes to copy and which to add. ( for more info about the format: http://www.w3.org/TR/NOTE-gdiff-19970901.html )

with a function I get a percentage. I know this is not 100% real, for example if u have fileB that is equal to half of fileA the similarity is of the function is 100%.

This is the DiffWriter implementation:

public class Distance implements DiffWriter {

    private long newData = 0;
    private long copiedData = 0;

    @Override
    public void flush() throws IOException {}

    @Override
    public void close() throws IOException {}

    @Override
    public void addData(byte arg0) throws IOException {
        newData++;
    }

    @Override
    public void addCopy(long arg0, int arg1) throws IOException {
        copiedData += arg1;
    }

    public double getSimilarity() {

        double a = (double) newData;
        double c = (double) copiedData;

        return (( c / (c + a) ) * 100.0);

    }

}

Here is how I call it:

import com.nothome.delta.Delta;

    File f1 = new File(...);
    File f2 = new File(...);

            Distance dw = new Distance();

    try {
        new Delta().compute(f1, f2, dw);

        dw.getSimilarity();

    } catch (Exception e) {
        e.printStackTrace();
    }

Upvotes: 0

casablanca
casablanca

Reputation: 70701

Since you mention the files are audio files, it would be better to define a similarity measure based on audio characteristics rather than simply doing a binary comparison. A quick search brought up a research project called MusicMiner that you may want to look into for further ideas.

Upvotes: 2

Related Questions