Iterating over CompactBuffer in an rdd

Question

I have an RDD[(String, Iterable[WikipediaArticle])] which looks something like this:

(Groovy,CompactBuffer(WikipediaArticle( {has a String title} , {has some text corresponding to that title}), WikipediaArticle( {has a String title} , {has some text corresponding to that title}))

curly brackets above are just to differentiate between title and text while making things cleaner

Groovy : is the String name
WikipediaArticle: class has two attributes title and text

I need an output of type: List[(String, Int)] where:
String: is the 1st element in the RDD which is unique on each line
In the above case that is "Groovy"
Int: is the count of WikipediaArticles inside the compactbuffer for that String

I have tried to make things as clear as possible, however, if you think there are chances to improve the question or you have any doubts please feel free to ask.

Leo C · Accepted Answer

If you treat each element of the RDD a (k, v) pair with the first keyword being k and the CompactBuffer being v, one approach would be to use map with partial function case like in the following:

case class WikipediaArticle(title: String, text: String)

val rdd = sc.parallelize(Seq(
  ( "Groovy", Iterable( WikipediaArticle("title1", "text1"), WikipediaArticle("title2", "text2") ) ),
  ( "nifty", Iterable( WikipediaArticle("title2", "text2"), WikipediaArticle("title3", "text3") ) ),
  ( "Funny", Iterable( WikipediaArticle("title1", "text1"), WikipediaArticle("title3", "text3"), WikipediaArticle("title4", "text4") ) )
))

rdd.map{ case (k, v) => (k, v.size) }
// res1: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[1] at map at :29

res1.collect.toList
// res2: List[(String, Int)] = List((Groovy,2), (nifty,2), (Funny,3))

Iterating over CompactBuffer in an rdd

Answers (1)

Related Questions