Reputation: 1291
I have built an RDD from a file where each element in the RDD is section from the file separated by a delimiter.
val inputRDD1:RDD[(String,Long)] = myUtilities.paragraphFile(spark,path1)
.coalesce(100*spark.defaultParallelism)
.zipWithIndex() //RDD[String, Long]
.filter(f => f._2!=0)
The reason I do the last operation above (filter) is to remove the first index 0.
Is there a better way to remove the first element rather than to check each element for the index value as done above?
Thanks!
Upvotes: 4
Views: 3699
Reputation: 149598
One possibility is to use RDD.mapPartitionsWithIndex
and to remove the first element from the iterator at index 0:
val inputRDD = myUtilities
.paragraphFile(spark,path1)
.coalesce(100*spark.defaultParallelism)
.mapPartitionsWithIndex(
(index, it) => if (index == 0) it.drop(1) else it,
preservesPartitioning = true
)
This way, you only ever advance a single item on the first iterator, where all others remain untouched. Is this be more efficient? Probably. Anyway, I'd test both versions to see which one performs better.
Upvotes: 5