Calculate cosine similarity of document relevance

Question

I have go the normalized TF-IDF for and also the keyword RDD and now want to compute the cosine similarity to find relevance score for the document .

So I tried as

    documentRdd = sc.textFile("documents.txt").flatMap(lambda l: re.split(r'[^\w]+',l))
    keyWords = sc.textFile("keywords.txt").flatMap(lambda l: re.split(r'[^\w]+',l))
    normalizer1 = Normalizer()
    hashingTF = HashingTF()
    tf = hashingTF.transform(documentRdd)
    tf.cache()
    idf = IDF().fit(tf)
    tfidf = idf.transform(tf)
    normalizedtfidf=normalizer1.transform(tfidf)

Now I wanted to calculate the cosine similarity between the normalizedtfidf and keyWords.So I tried using

x = Vectors.dense(normalizedtfidf)
y = Vectors.dense(keywordTF)
print(1 - x.dot(y)/(x.norm(2)*y.norm(2)) , "is the releavance score")

But this throw the error as

TypeError: float() argument must be a string or a number

Which means I am passing a wrong format .Any help is appreciated .

Update

I tried then

    x = Vectors.sparse(normalizedtfidf.count(),normalizedtfidf.collect())
    y = Vectors.sparse(keywordTF.count(),keywordTF.collect())

but got

TypeError: Cannot treat type as a vector

as the error.

cylim · Accepted Answer

You got the errors because you are attempting to convert RDD into Vectors forcibly.

You can achieve what you need without doing the conversion by doing the following steps :

Join both your RDDs into one RDD. Note that I am assuming you do not have a unique index in both RDDs for joining.

# Adding index to both RDDs by row.
rdd1 = normalizedtfidf.zipWithIndex().map(lambda arg : (arg[1], arg[0]))
rdd2 = keywordTF.zipWithIndex().map(lambda arg : (arg[1], arg[0]))

# Join both RDDs.
rdd_joined = rdd1.join(rdd2)

map RDD with a function to calculate cosine distance.

def cosine_dist(row):
    x = row[1][0]
    y = row[1][1]
    return (1 - x.dot(y)/(x.norm(2)*y.norm(2)))

res = rdd_joined.map(cosine_dist)

You can then use your results or run collect to see them.

Calculate cosine similarity of document relevance

Answers (1)

Related Questions