Reputation: 119
We have set up a dedicated cluster for our application on AWS.
This is the configuration of the cores ( we have 2 cores)
m5.xlarge
4 vCore, 16 GiB memory, EBS only storage
EBS Storage:64 GiB
Current dataset -
We are trying to run spark job which involves many joins and works with 80 million records each record with 60 + fields
Issue we are facing -
When we trying to save the final dataframe as athena table, its taking more than 1 hour and timing out.
As we are the only one using the cluster, what should be our configuration to ensure that we use all the cluster resources optimally
Current configuration
Executor Memory : 2G
Dynamic Allocation Enabled : true
Number of Executor Cores : 1
Number of Executors : 8
spark.dynamicAllocation.executorIdleTimeout : 3600
spark.sql.broadcastTimeout : 36000
Upvotes: 0
Views: 174
Reputation: 1410
Looking at your config some observation -
You are using
m5.xlarge
which is having 4 vCore, 16 GiB memory
Executor config
Number of Executor Cores : 1
Executor Memory : 2G
So at most 4 executor can spin up, and memory required by 4 executor is 8. So at the end you are not utilizing all the resource.
Also as @Shadowtrooper said, save the data in partition (If possible in Parquet format) if you can, it will also save cost when you query in Athena.
Upvotes: 2