Reputation: 11
I am running pyspark application in 32 core, 64 GB Server using spark-submit command.
Steps in Application:
df1 = Load 500 Million dataset from csv file(field1, field2, field3, field4).
df2 = Load 500 Million entries from mongodb(using spark mongo adapter)(field1, field2, field3).
Left Join operation(step throwing exception java.lang.OutOfMemoryError: Java heap space):
df_output = df1.join(df2, ["field1", "field2", "field3"], "left_outer").select("*")
Updating mongo collections using df_output with append mode.
Configuration in conf/spark-env.sh:
and there are more parameters that are set to there default value.
Setting up master and 1 worker with commands.
sbin/start-master.sh
/sbin/start-slave.sh master_ip
running script with command
nohup bin/spark-submit --packages org.mongodb.spark:mongo-spark-connector_2.11:2.0.0 --master master_ip ../test_scripts/test1.py > /logs/logs.out &
What should be the best approach for tuning configuration parameters for optimal performance for this dataset plus how should we configure above parameters for any dataset?
Upvotes: 1
Views: 309
Reputation: 485
Few things to consider if you run into memory issue. Make sure you set below paramters accordingly.
spark.executor.memory=yarn.nodemanager.resource.memory-mb * (spark.executor.cores / yarn.nodemanager.resource.cpu-vcores)
spark.yarn.executor.memoryOverhead=15-20% of spark.executor.memory
Try to increase spark.sql.shuffle.output.partitions parameter to greater than 2000 (default 200). Hope this helps
Upvotes: 0