Reputation: 3
Ok here is the formula in matlab:
function D = dumDistance(X,Y)
n1 = size(X,2);
n2 = size(Y,2);
D = zeros(n1,n2);
for i = 1:n1
for j = 1:n2
D(i,j) = sum((X(:,i)-Y(:,j)).^2);
end
end
Credits here (I know it's not a fast implementation but for the sake of the basic algorithm).
Now here is my understanding problem;
Say that we have a matrix dictionary=140x100
words. And a matrix page=140x40
words. Each column represents a word in the 140 dimensional space.
Now, if I use dumDistance(page,dictionairy)
it will return a 40x100
matrix with the distances.
What I want to achieve, is to find how close is each word of page
matrix to the dictionary
matrix, in order to represent the page according to dictionary with a histogram let's say.
I know, that If I take the min(40x100), ill get a 1x100 matrix with locations of min values to represent my histogram.
What I really cant understand here, is this 40x100 matrix. What data does this matrix represents anyway? I cant visualize this in my mind.
Upvotes: 0
Views: 626
Reputation: 104565
Minor comment before I start:
You should really use pdist2
instead. This is much faster and you'll get the same results as dumDistance
. In other words, you would call it like this:
D = pdist2(page.', dictionary.');
You need to transpose page
and dictionary
as pdist2
assumes that each row is an observation, while each column corresponds to a variable / feature. Your data is structured such that each column is an observation. This will return a 40 x 100
matrix like what you see in dumDistance
. However, pdist2
does not use for
loops.
Now onto your question:
D(i,j)
represents the Euclidean squared distance between word i
from your page and word j
from your dictionary. You have 40 words on your page and 100 words in your dictionary. Each word is represented by a 140 dimensional feature vector, and so the rows of D
index the words of page
while the columns of D
index the words of dictionary
.
What I mean here in terms of "distance" is in terms of the feature space. Each word from your page and dictionary are represented as a 140 length vector. Each entry (i,j)
of D
takes the ith vector from page
and the jth vector from dictionary
, each of their corresponding components subtracted, squared, and then they are summed up. This output is then stored into D(i,j)
. This gives you the dissimilarity between word i
from your page
and word j
from your dictionary
at D(i,j)
. The higher the value, the more dissimilar the two words are.
Minor Note: pdist2
computes the Euclidean distance while dumDistance
computes the Euclidean squared distance. If you want to have the same thing as dumDistance
, simply square every element in D
from pdist2
. In other words, simply compute D.^2
.
Hope this helps. Good luck!
Upvotes: 1