Reputation: 115

Compare the similarity between two text in Scala

I want to compare two texts in Scala and calculate the similarity rate. I begin to code this but I'm blocked :

import org.apache.spark._
import org.apache.spark.SparkContext._


object WordCount {

    def main(args: Array[String]):Unit = {
       val white = "/whiteCat.txt" // "The white cat is eating a white soup"
       val black  = "/blackCat.txt" // "The black cat is eating a white sandwich"
       val conf = new SparkConf().setAppName("wordCount")
       val sc = new SparkContext(conf)
       val b =  sc.textFile(white)
       val words = b.flatMap(line => line.split("\\W+"))
       val counts = words.map(word => (word, 1)).reduceByKey{case (x, y) => x + y}
       counts.take(10).foreach(println)
       //counts.saveAsTextFile(outputFile)
       }    
    }

I succeeded to split the words of each text and count the occurency of each word. For example in the file1 there is :

(The,1)
(white,2)
(cat,1)
(is,1)
(eating,1)
(a,1)
(soup,1)

To calculate the similarity rate. I have to do this algorithm but I'm not experienced with Scala

i=0
foreach word in the first text
   j = 0
   IF keyFile1[i] == keyFile2[j]
       THEN MIN(valueFile1[i], valueFile2[j]) / MAX(valueFile1[i], valueFile2[j])
   ELSE j++
   i++

Can you help me please ?

Upvotes: 0

Answers (3)

ulubeyn

Reputation: 3021

There are numerous algorithms to find similarity between strings. One of these methods is edit distance. There are different definitions of edit distance and there are different sets of operations based on the methodology. But main idea is finding minimum series of operations (insertion, deletion, substitution) to convert string a into string b.

Levenshtein distance and Longest Common Subsequence are widely known algorithms to find similarity between strings. But these methods are insensitive to contexts. Because of this situation, you may want to take a look at this article in which n-gram similarity and distance are represented. You can also find Scala implementations of these algorithms in github or rosetta code.

I hope it helps!

Upvotes: 0

Haijin

Reputation: 2681

That seems straight forward using for comprehension

 for( a <- counts1; b <- counts2 if a._1==b._1 ) yield Math.min(a._2,b._2)/Math.max(a._2,b._2)

Edit: sorry, the above code does not work. Here's a modified code with for comprehension. counts1 and counts2 are 2 counts from the question.

val result= for( (t1,t2) <- counts1.cartesian(counts2) if( t1._1==t2._1)) yield Math.min(t1._2,t2._2).toDouble/Math.max(t1._2,t2._2).toDouble

the result: result.foreach(println) 1.0 0.5 1.0 1.0 1.0

Upvotes: 0

Leo C

Reputation: 22439

You can use leftOuterJoin to join the two key/value-pair RDDs to generate a RDD of type Array[(String, (Int, Option[Int]))], gather both counts from the Tuples, flatten the counts to type Int, and apply your min/max formula, as in the following example:

val wordCountsWhite = sc.textFile("/path/to/whitecat.txt").
  flatMap(_.split("\\W+")).
  map((_, 1)).
  reduceByKey(_ + _)

wordCountsWhite.collect
// res1: Array[(String, Int)] = Array(
//   (is,1), (eating,1), (cat,1), (white,2), (The,1), (soup,1), (a,1)
// )

val wordCountsBlack = sc.textFile("/path/to/blackcat.txt").
  flatMap(_.split("\\W+")).
  map((_, 1)).
  reduceByKey(_ + _)

wordCountsBlack.collect
// res2: Array[(String, Int)] = Array(
//   (is,1), (eating,1), (cat,1), (white,1), (The,1), (a,1), (sandwich,1), (black,1)
// )

val similarityRDD = wordCountsWhite.leftOuterJoin(wordCountsBlack).map{
  case (k: String, (c1: Int, c2: Option[Int])) => {
    val counts = Seq(Some(c1), c2).flatten
    (k, counts.min.toDouble / counts.max )
  }
}

similarityRDD.collect
// res4: Array[(String, Double)] = Array(
//   (is,1.0), (eating,1.0), (cat,1.0), (white,0.5), (The,1.0), (soup,1.0), (a,1.0)
// )

Upvotes: 2

Compare the similarity between two text in Scala

Answers (3)

Related Questions