Reputation: 439
In spark when computing an RDD I was wondering if for example I have a RDD[Either[A,B]] and I want to obtain the RDD[A] and the RDD[B] basically I've found 2 approaches :
map
+ filter
val rddA = rddEither.filter(_.isLeft).map(case Left(a) => a)
val rddB = rddEither.filter(_.isRight).map(case Right(b) => b)
flatMap
val rddA = rddEither.flatMap { case Left(a) => Some(a) }
val rddB = rddEither.flatMap { case Right(b) => Some(b) }
If the flatMap more efficient? since there would require potentially less computation?
And other question - is it good to persist the RDD to speed up execution I mean the rddEither because I will compute 2 operation starting from that source or Spark will take care of that?
Upvotes: 1
Views: 1602
Reputation: 22850
Probably collect
would be a little bit more clear. (and maybe will take less computation, but I don't think that would have a big impact on performance).
val rddA = rddEither.collect { case Left(a) => a }
val rddB = rddEither.collect { case Right(b) => b }
"And other question: is good to persist the rdd to speed up execution i mean the rddEither because i will compute 2 operation starting from that source or spark will take care of that?"
Spark won't take care of that, spark is Lazy, that means that for each operation it will recompute everything it needs to give the result - unless there is a cache somewhere.
However, adding a cache don't necessarily will increase performance, since if you want it to be fast you would need to keep it in memory only which can hurt performance of other operations since you would have less memory. And if you save it in disk to reduce memory usage then the time taken to read an deserialize the data from disk could be the same or greater than the time to recompute the original RDD. Thus, you may need to benchmark multiple options to decide which will be better.
Note: this is a great post about the memory management in spark, it may be worth reading for tuning cache.
Upvotes: 2