Reputation: 611
I used spark RDD.isEmpty
after I used the filter
. But I think it cost much time for large data because isEmpty
is action operation with take(1)
.
here is example code.
val data = sc.parallelize(Seq(("a", 1), ("a", 2), ("b", 2), ("b", 3), ("c", 1)))
.reduceByKey(_+_)
.filter(_._2 > 5)
//Array[(String, Int)] = Array()
if(!data.isEmpty()){
//running code...
}
Is there an efficient way to check, whether the data value is empty or not?
Upvotes: 1
Views: 7872
Reputation: 2051
As you have already said, rdd.isEmpty is an action, unless you refactor the code to remove the if condition (using more functional style instead of imperative style), the inefficiency cannot be removed. The simplest solution, as already suggested, could be to use cache the rdd before calling isEmpty, so that the transformations will be executed only once if you have enough cache memory.
As I am not aware of, what you want as part of "if(data.isEmpty())", the only suggestion I could give is, rdd.map, rdd.foreach etc are perfectly valid even for empty RDD. "if(data.isEmpty())" could be one way of approaching the solution. If you could give a more detail of the problem, we could suggest a functional approach.
Upvotes: 3
Reputation: 8529
RDD.isEmpty
is the most efficient. It does the least amount of work possible.
Remember that and RDD is not data, it's an execution plan. It's not possible to check if there's data in an RDD without actually evaluating it, so you must execute an action and check the result.
If you need to check if an RDD is empty before the final action you can persist
it first to cache the intermediate state so it won't need to be reevaluated in later jobs.
Upvotes: 5