How to use mapPartitions in Spark Scala?

Question

I have DocsRDD : RDD[String, String]

val DocsRDD = sc.wholeTextFiles("myDirectory/*" , 2)

DocsRDD:

Doc1.txt , bla bla bla .....
 bla bla bla 
 bla ... bla
Doc2.txt , bla bla bla .....bla 
 bla bla 
 bla ... bla
Doc3.txt , bla bla bla .....
 bla bla bla 
 bla ... bla
Doc4.txt , bla bla 
  .....
 bla bla bla bla 
 ... bla

Is there an efficient, elegant way to extract n-grams from these with mapPartitions? So far i have tried everything, i have read everything i could find at least 5 times over and over about mapPartitions but i still cannot understand how to use it! It seems waaay too difficult to manipulate. In short i want :

val NGramsRDD = DocsRDD.map(x => (x._1 , x._2.sliding(n) ) )

but efficiently with mapPartitions. My basic misunderstanding of mapPartitions is :

OneDocRDD : RDD[String]

 val OneDocRDD = sc.textFile("myDoc1.txt" , 2)
                   .mapPartitions(s1 : Iterator[String] => s2 : Iterator[String])

I Cannot understand this! From when s1 was Iterator[String]? s1 is String after sc.textfile.

Alright my second question is : Will mapPartitions improve my overcome against map in this situation?

Last but not Least important: can f() be :

     f(Iterator[String]) : Iterator[Something else?]

How to use mapPartitions in Spark Scala?

Answers (1)

Related Questions