Apache Spark: is the show() function an expensive and unsafe operation?

Question

In Apache Spark, I know that when you perform actions on the data that ends up collecting the result on the driver end, for example, applying a collect() on the data, is an unsafe operation that can lead to an Out Of Memory Error if the collected size is larger than what the driver can house in its memory.

Is the show() function, which is applied to dataframes, a function that can lead to a OOM for the same reason or can I safely use the show() function (maybe for debugging reasons)?

10465355 · Accepted Answer

show is as safe as the execution plan of the evaluated Dataset. If the Dataset contains wide transformations (non-broadcasted joins, aggregations, repartitions, applications of window functions) or resource hungry narrow transformations (expensive UDF calls, "strongly" typed transformations with wide schema) calling show can trigger executor failure.

Unlike collect it will fetch only a small subset of data (20 records by default). So excluding local mode it is unlikely it will ever trigger driver failure.

Even if none of the above is true, it is still possible that show will evaluate all records. This might happen if pipeline contains highly restrictive selections (filter) which result in spare leading partitions.

Overall show, same as similar restricted operations like take (with small n) are as safe as you can get, but cannot guarantee successful execution.

Apache Spark: is the show() function an expensive and unsafe operation?

Answers (1)

Related Questions