Reddevil
Reddevil

Reputation: 125

cosine similarity document distance

I'm given two documents and I am asked to compute the frequency of the occurrence of each word in the documents. For example in doc1 and doc2 the word "CAT" appeared twice in each, then it appeared 4 times in total and I need to compute the frequency of its appearance.

Through some google search over the last three nights I found great algorithm called the cosine similarity. I understand now how it it works.

But I don't know how to implement it in Java. How should I convert words to vectors?

Suppose that my input is "how much wood chuck chuck of woodchuck could chuck wood" how could I convert the words into n vector space? Do I make an array of the words first and then iterate through the array with a count variable to see how many times this word occurred? But then doesn't it mean that we need at least n count variables?

Thank you so much for helping me figuring this out

Upvotes: 1

Views: 2017

Answers (3)

Dinesh Kumar
Dinesh Kumar

Reputation: 603

I was going through an awesome video series by MIT: Models of Computation, Document Distance. And I found this problem there.

So I have written a java code to find the distance between two documents, where a document is nothing but space-separated words.

import java.util.HashMap;
import java.util.Scanner;

public class document_distance {

//print the string array made from document
public static void printDoc(String[] doc) {
    System.out.println("=====printing doc words ====");
    int len = doc.length;
    for( int i=0; i<len; i++ )  {
        System.out.print(doc[i]+" ");
    }
    System.out.println();
}

public static void printMap(HashMap<String, Integer> dict) {
    System.out.println("=====printing dictionary (key,value) ====");
    for(String key: dict.keySet()) {
        System.out.println(key+" ->"+dict.get(key));
    }
}

public static void main(String[] args) {

    Scanner sc = new Scanner(System.in);
    String doc1[] = sc.nextLine().split(" ");
    String doc2[] = sc.nextLine().split(" ");

    //print both documents to verify that they are saved correctly!
    printDoc(doc1);
    printDoc(doc2);

    //create two dictionaries with keys as words and values as count of that word!
    HashMap<String, Integer> dict1 = new HashMap<String, Integer>();
    HashMap<String, Integer> dict2 = new HashMap<String, Integer>();

    //update counts for doc1 both dictionaries
    for(int i=0; i<doc1.length ;i++) {
        if(!dict1.containsKey(doc1[i])) { //word is not in dict1 yet
            dict1.put(doc1[i], 1);
        }
        else if(dict1.containsKey(doc1[i])) { //word is in dict1 
            dict1.put(doc1[i], dict1.get(doc1[i]) + 1);
        }

        if(!dict2.containsKey(doc1[i])) { //word is not in dict2 yet
            dict2.put(doc1[i], 0);
        }


    }

    //update counts for doc1 both dictionaries
    for(int i=0; i<doc2.length ;i++) {
        if(!dict2.containsKey(doc2[i])) { //word is not in dict2 yet
            dict2.put(doc2[i], 1);
        }
        else if(dict2.containsKey(doc2[i])) { //word is in dict2
            dict2.put(doc2[i], dict2.get(doc2[i]) + 1);
        }

        if(!dict1.containsKey(doc2[i])) { //word is not in dict1
            dict1.put(doc2[i], 0);
        }


    }
    //print dictionaries
    printMap(dict1);
    printMap(dict2);

    int dotProduct =0;
    int doc1sq = 0;
    int doc2sq = 0;
    for(int i=0; i<doc1.length ;i++) {
        dotProduct = dotProduct + (dict1.get(doc1[i])) * (dict2.get(doc1[i]));
        doc1sq = doc1sq +  (dict1.get(doc1[i])) *  (dict1.get(doc1[i]));
        doc2sq = doc2sq +  (dict2.get(doc1[i])) *  (dict2.get(doc1[i]));    
    }

    double similarity = dotProduct / Math.sqrt(doc1sq*doc2sq);
    System.out.print("similarity = "+ similarity);

}

}

Upvotes: 1

Abhay
Abhay

Reputation: 768

Yes, that's right. You would need as many components as there are unique words in your two documents together, if you want to account for frequency of every word.

One simple way to do this in Java is to make a HashMap, with String keys and Integer values. Just go through the list of words as they appear in the document, and add one to the corresponding entry in the HashMap. In the end you will get the counts as values against the words as the keys. Make sure when adding one, that if an entry does not exist, you initialize it to 1.

More details in pseudocode:

for word in doc1
    if (!vector1.has(word)) {vector1.put(word, 0);}
    if (!vector2.has(word)) {vector2.put(word, 0);}
    vector1.put(word, vector1.get(word) + 1);
done
same loop for doc2, with the last line changed to vector2

Now you have both the vectors with the same words as keys, and counts in respective documents. Then you can use any one to walk over the words:

dotp = 0; v1sq = 0; v2sq = 0
for word in vector1
    dotp = dotp + vector1.get(word) * vector2.get(word)
    v1sq = v1sq + vector1.get(word) * the-same-thing
    v2sq = the-same-same-thing
done
similarity = dotp / sqrt(v1sq * v2sq)

There you are! Just work out the Java part.

Upvotes: 0

Bohemian
Bohemian

Reputation: 425348

Hold the results an a Map<String, Integer>, and use String#split() to separate the input into words.

you only need one line of code, after you've read the text into a String:

Map<String, Integer> frequencies = Arrays
    .stream(text.toLowerCase().split("[^a-z']+"))
    .collect(Collectors.groupingBy(s -> s, Collectors.counting());

Upvotes: 1

Related Questions