Reputation: 2750
I have go the normalized TF-IDF for and also the keyword RDD and now want to compute the cosine similarity to find relevance score for the document .
So I tried as
documentRdd = sc.textFile("documents.txt").flatMap(lambda l: re.split(r'[^\w]+',l))
keyWords = sc.textFile("keywords.txt").flatMap(lambda l: re.split(r'[^\w]+',l))
normalizer1 = Normalizer()
hashingTF = HashingTF()
tf = hashingTF.transform(documentRdd)
tf.cache()
idf = IDF().fit(tf)
tfidf = idf.transform(tf)
normalizedtfidf=normalizer1.transform(tfidf)
Now I wanted to calculate the cosine similarity between the normalizedtfidf and keyWords.So I tried using
x = Vectors.dense(normalizedtfidf)
y = Vectors.dense(keywordTF)
print(1 - x.dot(y)/(x.norm(2)*y.norm(2)) , "is the releavance score")
But this throw the error as
TypeError: float() argument must be a string or a number
Which means I am passing a wrong format .Any help is appreciated .
Update
I tried then
x = Vectors.sparse(normalizedtfidf.count(),normalizedtfidf.collect())
y = Vectors.sparse(keywordTF.count(),keywordTF.collect())
but got
TypeError: Cannot treat type as a vector
as the error.
Upvotes: 0
Views: 864
Reputation: 542
You got the errors because you are attempting to convert RDD into Vectors forcibly.
You can achieve what you need without doing the conversion by doing the following steps :
# Adding index to both RDDs by row.
rdd1 = normalizedtfidf.zipWithIndex().map(lambda arg : (arg[1], arg[0]))
rdd2 = keywordTF.zipWithIndex().map(lambda arg : (arg[1], arg[0]))
# Join both RDDs.
rdd_joined = rdd1.join(rdd2)
map
RDD with a function to calculate cosine distance.def cosine_dist(row):
x = row[1][0]
y = row[1][1]
return (1 - x.dot(y)/(x.norm(2)*y.norm(2)))
res = rdd_joined.map(cosine_dist)
You can then use your results or run collect
to see them.
Upvotes: 1