Performance Issue with Spark dataframe exist

Question

Which is recommended and why in respect of performance spark.dataframe.count() or spark.dataframe.take(1).

sgungormus · Accepted Answer

take(1) is more efficient than count(). If you check the source code of RDD;

Take the first num elements of the RDD. It works by first scanning one partition, and use the results from that partition to estimate the number of additional partitions needed to satisfy the limit.

For your use case isEmpty() should be the best option. It's source code is using take(1) yet again;

def isEmpty(): Boolean = withScope {
    partitions.length == 0 || take(1).length == 0
}

Performance Issue with Spark dataframe exist

Answers (1)

Related Questions