Ricky
Ricky

Reputation: 2750

Calculate cosine similarity of document relevance

I have go the normalized TF-IDF for and also the keyword RDD and now want to compute the cosine similarity to find relevance score for the document .

So I tried as

    documentRdd = sc.textFile("documents.txt").flatMap(lambda l: re.split(r'[^\w]+',l))
    keyWords = sc.textFile("keywords.txt").flatMap(lambda l: re.split(r'[^\w]+',l))
    normalizer1 = Normalizer()
    hashingTF = HashingTF()
    tf = hashingTF.transform(documentRdd)
    tf.cache()
    idf = IDF().fit(tf)
    tfidf = idf.transform(tf)
    normalizedtfidf=normalizer1.transform(tfidf)

Now I wanted to calculate the cosine similarity between the normalizedtfidf and keyWords.So I tried using

x = Vectors.dense(normalizedtfidf)
y = Vectors.dense(keywordTF)
print(1 - x.dot(y)/(x.norm(2)*y.norm(2)) , "is the releavance score")

But this throw the error as

TypeError: float() argument must be a string or a number

Which means I am passing a wrong format .Any help is appreciated .

Update

I tried then

    x = Vectors.sparse(normalizedtfidf.count(),normalizedtfidf.collect())
    y = Vectors.sparse(keywordTF.count(),keywordTF.collect())

but got

TypeError: Cannot treat type as a vector

as the error.

Upvotes: 0

Views: 864

Answers (1)

cylim
cylim

Reputation: 542

You got the errors because you are attempting to convert RDD into Vectors forcibly.

You can achieve what you need without doing the conversion by doing the following steps :

  1. Join both your RDDs into one RDD. Note that I am assuming you do not have a unique index in both RDDs for joining.
# Adding index to both RDDs by row.
rdd1 = normalizedtfidf.zipWithIndex().map(lambda arg : (arg[1], arg[0]))
rdd2 = keywordTF.zipWithIndex().map(lambda arg : (arg[1], arg[0]))

# Join both RDDs.
rdd_joined = rdd1.join(rdd2)
  1. map RDD with a function to calculate cosine distance.
def cosine_dist(row):
    x = row[1][0]
    y = row[1][1]
    return (1 - x.dot(y)/(x.norm(2)*y.norm(2)))

res = rdd_joined.map(cosine_dist)

You can then use your results or run collect to see them.

Upvotes: 1

Related Questions