Reputation: 33
Which is recommended and why in respect of performance spark.dataframe.count() or spark.dataframe.take(1).
Upvotes: 0
Views: 87
Reputation: 119
take(1)
is more efficient than count()
. If you check the source code of RDD;
Take the first num elements of the RDD. It works by first scanning one partition, and use the results from that partition to estimate the number of additional partitions needed to satisfy the limit.
For your use case isEmpty()
should be the best option. It's source code is using take(1) yet again;
def isEmpty(): Boolean = withScope {
partitions.length == 0 || take(1).length == 0
}
Upvotes: 0