Reputation: 103
I'm working with Apache Spark and Scala. I have an RDD of String,Int
val counts =words.map(word => (word, 1)).reduceByKey((a,b) => (a + b))
Now I reduced the RDD by Key, but I'd like to add another feature to reduce also the words that are similar.
I though to use Levenshtein distance, Euclidean distance or cosine distance.
So, how can I apply one of this functions to reduce my RDD?
Example:
RDD -> (forks,12), (fork,4), (chair,15) , (table,1), (tables,11)
Admitting that the similarity algorithm works, how can I obtain a reduced RDD like:
RDD -> (fork,16), (table,12), (chair,15)
I tried something like:
counts.foldLeft(){(x,y) =>
if(x._1.euclideanDistance(y._1) > 0.9)
(x,x._2+y._2)
}
Upvotes: 2
Views: 1929
Reputation: 25919
@Daniel's reply is probably the proper way to go to solve the overall problem.
Regarding the specific q. when you do an if
in the fold you need also to supply the else
in your case that would be preserving x with it's current count
Upvotes: 0
Reputation: 27455
What you are trying will not work.
If you only have a distance(a, b)
function, it is really inefficient and complicated to solve the problem. You would need to use RDD.cartesian
to generate all the possible (word1, word2)
pairs. Then filter out those with too great a distance. Now you have the similar word pairs. Let's say they are (fox, fix)
, (fix, six)
, and their reversals. You then want to sum up the counts for fox
, fix
, and six
. For this you need to find the connected components in the graph defined by the similar word pairs. Once you have the component ID for each word, you sum the counts by the component IDs.
I think the solution rather is to write a function that can turn a word into its "canonical" form. It would turn forks
, forking
, and forked
into fork
. Then you can just apply this and reduceByKey
again.
It will be fastest to do this step without Spark. Once you have calculated counts
with Spark, you have a tiny data set — one integer for each distinct word. It's easiest to collect
it and then map
and groupBy
counts
locally.
Upvotes: 6