S.Kang
S.Kang

Reputation: 611

how to check RDD is empty efficiently?

I used spark RDD.isEmpty after I used the filter. But I think it cost much time for large data because isEmpty is action operation with take(1).

here is example code.

val data = sc.parallelize(Seq(("a", 1), ("a", 2), ("b", 2), ("b", 3), ("c", 1)))
.reduceByKey(_+_)
.filter(_._2 > 5)
//Array[(String, Int)] = Array()

if(!data.isEmpty()){
    //running code...
}

Is there an efficient way to check, whether the data value is empty or not?

Upvotes: 1

Views: 7872

Answers (2)

rakesh
rakesh

Reputation: 2051

As you have already said, rdd.isEmpty is an action, unless you refactor the code to remove the if condition (using more functional style instead of imperative style), the inefficiency cannot be removed. The simplest solution, as already suggested, could be to use cache the rdd before calling isEmpty, so that the transformations will be executed only once if you have enough cache memory.

As I am not aware of, what you want as part of "if(data.isEmpty())", the only suggestion I could give is, rdd.map, rdd.foreach etc are perfectly valid even for empty RDD. "if(data.isEmpty())" could be one way of approaching the solution. If you could give a more detail of the problem, we could suggest a functional approach.

Upvotes: 3

puhlen
puhlen

Reputation: 8529

RDD.isEmpty is the most efficient. It does the least amount of work possible.

Remember that and RDD is not data, it's an execution plan. It's not possible to check if there's data in an RDD without actually evaluating it, so you must execute an action and check the result.

If you need to check if an RDD is empty before the final action you can persist it first to cache the intermediate state so it won't need to be reevaluated in later jobs.

Upvotes: 5

Related Questions