Reputation: 75
I am working on a logic to join 13 files and write the final file into a blob storage. Input files are in CSV format and output is written as parquet. Out of the 13 files, file1 is 950mb, file2 is 50mb, file3 is 150mb, file4 is 620mb, file5 is 235mb, file6&7 are less than 1mb, file8 is 70mb and rest all are in kbs. No other computation is being performed yet I see 2GB of spill memory. How to avoid this memory spill?
I did some investigation and understood about AQE which is by default enabled in latest spark version and AQE takes care of partitioning and coalesce by itself so there is no 200 partitions after a shuffle. In my case the final written file has 13 parquet partitions each around 60mb.
Secondly, I focused on the join types with AQE enabled. I see that AQE does a broadcast on all the joins except for file1, file4 and file5. These three joins are performed with Sortmerge join and rest as BroadcastHash join. So AQE does its job in choosing the right join. Still memory spill exists.
Next I started to read on how to reduce memory spill. I understood that memory spill happens when the partition size is bigger than what spark memory could handle resulting in spilling data to disk. This spillage happens during the shuffle stage in the join and one suggestion is to avoid shuffle by bucketing the data. So I started to bucket file1, file4 and file5 on join keys (which is same for the 3 files) and distributed them each into 10 buckets. The DAG results showed there is no shuffle this time but still memory spill exits with same size.
I am unable to reduce this memory spillage after many tries with different bucket sizes. I understand that my files are not that big but this is one month data and my downstream applications will process much bigger files as I add more monthly data to it.
I am using standard_D3_v2 with 14GB & 4 cores with worker range as 2-8. This join operation on 13 files uses 6 workers where only 2 workers are used till end of join phase and for writing the output file it uses 4 more workers (observed this based on event timeline).
How to avoid this spillage in memory?
Upvotes: 1
Views: 4408
Reputation: 1152
To avoid spilling, you could either (or both) :
Upvotes: 2