Reputation: 115
I want to compare two texts in Scala and calculate the similarity rate. I begin to code this but I'm blocked :
import org.apache.spark._
import org.apache.spark.SparkContext._
object WordCount {
def main(args: Array[String]):Unit = {
val white = "/whiteCat.txt" // "The white cat is eating a white soup"
val black = "/blackCat.txt" // "The black cat is eating a white sandwich"
val conf = new SparkConf().setAppName("wordCount")
val sc = new SparkContext(conf)
val b = sc.textFile(white)
val words = b.flatMap(line => line.split("\\W+"))
val counts = words.map(word => (word, 1)).reduceByKey{case (x, y) => x + y}
counts.take(10).foreach(println)
//counts.saveAsTextFile(outputFile)
}
}
I succeeded to split the words of each text and count the occurency of each word. For example in the file1 there is :
(The,1)
(white,2)
(cat,1)
(is,1)
(eating,1)
(a,1)
(soup,1)
To calculate the similarity rate. I have to do this algorithm but I'm not experienced with Scala
i=0
foreach word in the first text
j = 0
IF keyFile1[i] == keyFile2[j]
THEN MIN(valueFile1[i], valueFile2[j]) / MAX(valueFile1[i], valueFile2[j])
ELSE j++
i++
Can you help me please ?
Upvotes: 0
Views: 1419
Reputation: 3021
There are numerous algorithms to find similarity between strings. One of these methods is edit distance. There are different definitions of edit distance and there are different sets of operations based on the methodology. But main idea is finding minimum series of operations (insertion, deletion, substitution) to convert string a into string b.
Levenshtein distance and Longest Common Subsequence are widely known algorithms to find similarity between strings. But these methods are insensitive to contexts. Because of this situation, you may want to take a look at this article in which n-gram similarity and distance are represented. You can also find Scala implementations of these algorithms in github or rosetta code.
I hope it helps!
Upvotes: 0
Reputation: 2681
That seems straight forward using for comprehension
for( a <- counts1; b <- counts2 if a._1==b._1 ) yield Math.min(a._2,b._2)/Math.max(a._2,b._2)
Edit: sorry, the above code does not work. Here's a modified code with for comprehension. counts1 and counts2 are 2 counts from the question.
val result= for( (t1,t2) <- counts1.cartesian(counts2) if( t1._1==t2._1)) yield Math.min(t1._2,t2._2).toDouble/Math.max(t1._2,t2._2).toDouble
the result: result.foreach(println) 1.0 0.5 1.0 1.0 1.0
Upvotes: 0
Reputation: 22439
You can use leftOuterJoin
to join the two key/value-pair RDDs to generate a RDD of type Array[(String, (Int, Option[Int]))]
, gather both counts from the Tuples, flatten the counts to type Int
, and apply your min/max formula, as in the following example:
val wordCountsWhite = sc.textFile("/path/to/whitecat.txt").
flatMap(_.split("\\W+")).
map((_, 1)).
reduceByKey(_ + _)
wordCountsWhite.collect
// res1: Array[(String, Int)] = Array(
// (is,1), (eating,1), (cat,1), (white,2), (The,1), (soup,1), (a,1)
// )
val wordCountsBlack = sc.textFile("/path/to/blackcat.txt").
flatMap(_.split("\\W+")).
map((_, 1)).
reduceByKey(_ + _)
wordCountsBlack.collect
// res2: Array[(String, Int)] = Array(
// (is,1), (eating,1), (cat,1), (white,1), (The,1), (a,1), (sandwich,1), (black,1)
// )
val similarityRDD = wordCountsWhite.leftOuterJoin(wordCountsBlack).map{
case (k: String, (c1: Int, c2: Option[Int])) => {
val counts = Seq(Some(c1), c2).flatten
(k, counts.min.toDouble / counts.max )
}
}
similarityRDD.collect
// res4: Array[(String, Double)] = Array(
// (is,1.0), (eating,1.0), (cat,1.0), (white,0.5), (The,1.0), (soup,1.0), (a,1.0)
// )
Upvotes: 2