Pairwise Distance calculation (multidimentional matrix) for features similarity

Question

Ok here is the formula in matlab:

function D = dumDistance(X,Y)
n1 = size(X,2);
n2 = size(Y,2);
D = zeros(n1,n2);
for i = 1:n1
    for j = 1:n2
        D(i,j) = sum((X(:,i)-Y(:,j)).^2);
    end
end

Credits here (I know it's not a fast implementation but for the sake of the basic algorithm).

Now here is my understanding problem;

Say that we have a matrix dictionary=140x100 words. And a matrix page=140x40 words. Each column represents a word in the 140 dimensional space.

Now, if I use dumDistance(page,dictionairy) it will return a 40x100 matrix with the distances.

What I want to achieve, is to find how close is each word of page matrix to the dictionary matrix, in order to represent the page according to dictionary with a histogram let's say.

I know, that If I take the min(40x100), ill get a 1x100 matrix with locations of min values to represent my histogram.

What I really cant understand here, is this 40x100 matrix. What data does this matrix represents anyway? I cant visualize this in my mind.

rayryeng · Accepted Answer

Minor comment before I start:

You should really use pdist2 instead. This is much faster and you'll get the same results as dumDistance. In other words, you would call it like this:

D = pdist2(page.', dictionary.');

You need to transpose page and dictionary as pdist2 assumes that each row is an observation, while each column corresponds to a variable / feature. Your data is structured such that each column is an observation. This will return a 40 x 100 matrix like what you see in dumDistance. However, pdist2 does not use for loops.

Now onto your question:

D(i,j) represents the Euclidean squared distance between word i from your page and word j from your dictionary. You have 40 words on your page and 100 words in your dictionary. Each word is represented by a 140 dimensional feature vector, and so the rows of D index the words of page while the columns of D index the words of dictionary.

What I mean here in terms of "distance" is in terms of the feature space. Each word from your page and dictionary are represented as a 140 length vector. Each entry (i,j) of D takes the i^th vector from page and the j^th vector from dictionary, each of their corresponding components subtracted, squared, and then they are summed up. This output is then stored into D(i,j). This gives you the dissimilarity between word i from your page and word j from your dictionary at D(i,j). The higher the value, the more dissimilar the two words are.

Minor Note: pdist2 computes the Euclidean distance while dumDistance computes the Euclidean squared distance. If you want to have the same thing as dumDistance, simply square every element in D from pdist2. In other words, simply compute D.^2.

Hope this helps. Good luck!

Pairwise Distance calculation (multidimentional matrix) for features similarity

Answers (1)

Related Questions