Reputation: 1618
I am executing this simple code on databricks:
df = spark.read.table(table_name).sample(fraction=0.1)
my_df = df.collect()
I'm accessing an external managed Delta Table on my Unity Catalog, the table has the following properties:
Size
3.1GiB, 13 files
Columns
13
The cluster I am using is quite big, m4x10large with 160GB or memory, and im also heavily downsampling the data.
The code executes quite quickly, but then it gets stuck forever here:
And the cluster is doing absolutely nothing in the meanhwile. I looked at the logs, and the only thing that caught my attention is:
2024-09-20T11:10:57.351+0000: [GC (Allocation Failure) [PSYoungGen: 36727934K->1398166K(35979264K)] 37066227K->1736467K(119453184K), 13.8056805 secs] [Times: user=68.37 sys=1.95, real=13.80 secs]
That is spammed until eventually the clusters simply dies, I think because of some timeout set from databricks.
Any idea of where to start to debug this? The dataset has been optimized as at first I thought it could have been a problem with partitioning.
Upvotes: 0
Views: 182
Reputation: 36
What's the driver size? The collect method will bring all the data to the driver and if its close to the memory we are trying to fetch, it will result in garbage collection.
If display(df) is working without any problem, driver size could be the potential issue.
Upvotes: 0