Reputation: 2108
In Apache Spark, I know that when you perform actions on the data that ends up collecting the result on the driver end, for example, applying a collect()
on the data, is an unsafe operation that can lead to an Out Of Memory Error if the collected size is larger than what the driver can house in its memory.
Is the show()
function, which is applied to dataframes, a function that can lead to a OOM for the same reason or can I safely use the show()
function (maybe for debugging reasons)?
Upvotes: 0
Views: 1320
Reputation: 4631
show
is as safe as the execution plan of the evaluated Dataset
. If the Dataset
contains wide transformations (non-broadcasted joins, aggregations, repartitions, applications of window functions) or resource hungry narrow transformations (expensive UDF calls, "strongly" typed transformations with wide schema) calling show
can trigger executor failure.
Unlike collect
it will fetch only a small subset of data (20 records by default). So excluding local
mode it is unlikely it will ever trigger driver failure.
Even if none of the above is true, it is still possible that show
will evaluate all records. This might happen if pipeline contains highly restrictive selections (filter) which result in spare leading partitions.
Overall show
, same as similar restricted operations like take
(with small n
) are as safe as you can get, but cannot guarantee successful execution.
Upvotes: 3