Cosine similarity over tf idf output in spark dataframe (scala)

Question

I am using Spark Scala to calculate cosine similarity between the Dataframe rows.

Dataframe format is below:

root
 |-- id: long (nullable = true)
 |-- features: vector (nullable = true)

Sample of the dataframe is below:

+---+--------------------+
| id|            features|
+---+--------------------+
| 65|(10000,[48,70,87,...|
|191|(10000,[1,73,77,1...|
+---+--------------------+

Code that gives me the result is below:

val df = spark.read.json("articles_line.json")
val tokenizer = new Tokenizer().setInputCol("desc").setOutputCol("words")
val wordsDF = tokenizer.transform(df)

def flattenWords = udf( (s: Seq[Seq[String]]) => s.flatMap(identity) )
val groupedDF = wordsDF.groupBy("id").
  agg(flattenWords(collect_list("words")).as("grouped_words"))
val hashingTF = new HashingTF().
  setInputCol("grouped_words").setOutputCol("rawFeatures").setNumFeatures(10000)
val featurizedData = hashingTF.transform(groupedDF)
val idf = new IDF().setInputCol("rawFeatures").setOutputCol("features")
val idfModel = idf.fit(featurizedData)
val rescaledData = idfModel.transform(featurizedData)
val asDense = udf((v: Vector) => v.toDense) //transform to dense matrix
val newDf = rescaledData.select('id, 'features)
    .withColumn("dense_features", asDense($"features")

Final dataframe looks like

+-----+--------------------+--------------------+
|   id|            features|      dense_features|
+-----+--------------------+--------------------+
|21209|(10000,[128,288,2...|[0.0,0.0,0.0,0.0,...|
|21223|(10000,[8,18,32,4...|[0.0,0.0,0.0,0.0,...|
+-----+--------------------+--------------------+

I don't understand how to process "dense_features" to calculate cosine similarity. This article didn't work for me. Appreciate any help.

Example of one row of dense_features. For simplicity the length is cuted.

[[0.0,0.0,0.0,0.0,7.08,0.0,0.0,0.0,0.0,2.24,0.0,0.0,0.0,0.0,0.0,,9.59]]

Cosine similarity over tf idf output in spark dataframe (scala)

Answers (1)

Related Questions