Spark RDD isEmpty method throws NullPointerException, when RDD is not null

Question

This took me by surprise (while explaining to someone, unfortunately).

I am curious about what happens internally in spark in the following snippet.

val rdd = sc.parallelize(null)
rdd == null //false
rdd.isEmpty //NullPointerException

Before you ask, I agree parallelizing null is arguable, but this is simply a condition we run into in our streaming application.

I read somewhere that 'isEmpty' goes in and internally calls rdd.take(1), which ultimately throws the exception, but this seems inconsistent with language behavior. Also, I found that in some cases it takes longer (few seconds sometimes) to return with an NPE, although that could be because it goes over the network looking for data.

So the question is, why does this happen? Is this expected behavior? Is there a better way to deal with this than caching NPE?

Many thanks in advance!

user8628494 · Accepted Answer

parallelize method expects a Seq[T]. While null is a valid substitution NullPointerException is to be expected whenever it is accessed as a Seq and it is no equivalent to an empty Seq.

Either use SparkContext.emptyRDD:

sc.emptyRDD[T]

or emtpy seq

sc.parallelize(Seq.emtpy[T])

Spark RDD isEmpty method throws NullPointerException, when RDD is not null

Answers (1)

Related Questions