Dalganjan Sengar
Dalganjan Sengar

Reputation: 33

Performance Issue with Spark dataframe exist

Which is recommended and why in respect of performance spark.dataframe.count() or spark.dataframe.take(1).

Upvotes: 0

Views: 87

Answers (1)

sgungormus
sgungormus

Reputation: 119

take(1) is more efficient than count(). If you check the source code of RDD;

Take the first num elements of the RDD. It works by first scanning one partition, and use the results from that partition to estimate the number of additional partitions needed to satisfy the limit.

For your use case isEmpty() should be the best option. It's source code is using take(1) yet again;

def isEmpty(): Boolean = withScope {
    partitions.length == 0 || take(1).length == 0
}

Upvotes: 0

Related Questions