Reputation: 21
Question:
Following are keywords, frequencies, and token counts from 3 other documents.
Doc 4 – tablet: 7; memory: 5; apps: 8; sluggish: 5
Doc 5 – memory: 4; performance: 6; playbook: 8; apps: 6
Doc 6 –tablet: 6; performance: 3; playbook: 7; sluggish: 3
Token counts: Doc 4: 55 Doc 5: 60 Doc 6: 65
(i) Use Euclidean Distance to calculate similarity values for the three pairs of documents (4,5), (4,6), (5,6) with relative frequency values. State the distance for each pair to 4 decimal places (4 d.p.).
I have tried to use the Euclidean Distance formula with the given pairs of documents to find the distance for each pair.
This is the equation that i have tried to use:
According to the solutions this is what the answer should be:
Euclidean D4,D5 = 0.2343 to 4.d.p
Euclidean D5,D6 = 0.1693 to 4.d.p
Euclidean D4,D6 = 0.2153 to 4.d.p
Any help would be appreciated.
Upvotes: 2
Views: 4771
Reputation: 696
First you should make your document-term matrix based on your term-frequency. Term-frequency of a term means the number of times that term is repeated in a document divided by number of tokens document has. So we just made the below table:
As you mentioned the distance formula yourself I will just calculate the distance between document 4 and 5 as an example.
d(Document4,Document5) = [(7/55-0)^2 + (5/55-4/60)^2 + (8/55-6/60)^2 + (5/55-0)^2 + (0-6/60)^2 + (0-8/60)^2]^(1/2) = 0.23428614982 which is rounded to 0.2343.
Upvotes: 2
Reputation: 6564
The Euclidean distance between points p and q is the length of the line segment connecting them (pq).
In Cartesian coordinates, if p = (p1, p2,..., pn) and q = (q1, q2,..., qn) are two points in Euclidean n-space, then the distance (d) from p to q, or from q to p is given by the Pythagorean formula:
d(p ,q) = d (q ,p) = [(p1-q1)^2 + (p1-q1)^2 + ... (pn-qn)^2]^(1/2)
Let's normalize the given like this.
Doc 4 – tablet: 7, memory: 5, apps: 8, sluggish: 5, playbook: 0, performance: 0
Doc 5 – tablet: 0, memory: 4, apps: 6, sluggish: 0, playbook: 8, performance: 6
Doc 6 – tablet: 6, memory: 0, apps:0, sluggish: 3, playbook: 7, performance: 3
then according to above formula,
D(Doc4, Doc5) = [(7-0)^2 + (5-4)^2 + (8-6)^2 + (5-0)^2 + (8-0)^2 + (0-0)^2]^(1/2) = [49+1+4+25+64]^(1/2) ~= 11.96
You can calculate the other two pairs as I've done.
If needed let me know, thus I add a sample snippet to calculate this programmatically.
Upvotes: 0
Reputation: 907
The Euclidean distance
is a popular heuristic and the formula is as follows:
Suppose you have 2 points (a1,b1)
and (a2,b2)
, then the Euclidean distance
between these points is given as: SquareRoot( (a2-a1)^2 + (b2-b1)^2 )
.
In your case,
Doc 4 - (7,5,8,5)
Doc 5 - (4,6,8,6)
So the formula to apply would be,
SquareRoot( (a2-a1)^2 + (b2-b1)^2 + (c2-c1)^2 + (d2-d1)^2 ).
Upvotes: 0