Alessio Conese
Alessio Conese

Reputation: 103

Using Apache-Spark, reduce or fold an RDD depending on a condition

I'm working with Apache Spark and Scala. I have an RDD of String,Int

val counts =words.map(word => (word, 1)).reduceByKey((a,b) => (a + b))     

Now I reduced the RDD by Key, but I'd like to add another feature to reduce also the words that are similar.

I though to use Levenshtein distance, Euclidean distance or cosine distance.

So, how can I apply one of this functions to reduce my RDD?

Example:

RDD ->  (forks,12), (fork,4), (chair,15) , (table,1), (tables,11)

Admitting that the similarity algorithm works, how can I obtain a reduced RDD like:

RDD -> (fork,16), (table,12), (chair,15)

I tried something like:

counts.foldLeft(){(x,y) => 
  if(x._1.euclideanDistance(y._1) > 0.9) 
    (x,x._2+y._2) 
}

Upvotes: 2

Views: 1929

Answers (2)

Arnon Rotem-Gal-Oz
Arnon Rotem-Gal-Oz

Reputation: 25919

@Daniel's reply is probably the proper way to go to solve the overall problem.

Regarding the specific q. when you do an if in the fold you need also to supply the else in your case that would be preserving x with it's current count

Upvotes: 0

Daniel Darabos
Daniel Darabos

Reputation: 27455

What you are trying will not work.

If you only have a distance(a, b) function, it is really inefficient and complicated to solve the problem. You would need to use RDD.cartesian to generate all the possible (word1, word2) pairs. Then filter out those with too great a distance. Now you have the similar word pairs. Let's say they are (fox, fix), (fix, six), and their reversals. You then want to sum up the counts for fox, fix, and six. For this you need to find the connected components in the graph defined by the similar word pairs. Once you have the component ID for each word, you sum the counts by the component IDs.

I think the solution rather is to write a function that can turn a word into its "canonical" form. It would turn forks, forking, and forked into fork. Then you can just apply this and reduceByKey again.

It will be fastest to do this step without Spark. Once you have calculated counts with Spark, you have a tiny data set — one integer for each distinct word. It's easiest to collect it and then map and groupBy counts locally.

Upvotes: 6

Related Questions