jarvis
jarvis

Reputation: 11

Getting java.lang.OutOfMemoryError: on submitting pyspark application

I am running pyspark application in 32 core, 64 GB Server using spark-submit command.

Steps in Application:

  1. df1 = Load 500 Million dataset from csv file(field1, field2, field3, field4).

  2. df2 = Load 500 Million entries from mongodb(using spark mongo adapter)(field1, field2, field3).

  3. Left Join operation(step throwing exception java.lang.OutOfMemoryError: Java heap space):

    df_output = df1.join(df2, ["field1", "field2", "field3"], "left_outer").select("*")

  4. Updating mongo collections using df_output with append mode.

Configuration in conf/spark-env.sh:

and there are more parameters that are set to there default value.

Setting up master and 1 worker with commands.

running script with command

nohup bin/spark-submit --packages org.mongodb.spark:mongo-spark-connector_2.11:2.0.0 --master master_ip ../test_scripts/test1.py > /logs/logs.out &

What should be the best approach for tuning configuration parameters for optimal performance for this dataset plus how should we configure above parameters for any dataset?

Upvotes: 1

Views: 309

Answers (1)

vikrame
vikrame

Reputation: 485

Few things to consider if you run into memory issue. Make sure you set below paramters accordingly.

spark.executor.memory=yarn.nodemanager.resource.memory-mb * (spark.executor.cores / yarn.nodemanager.resource.cpu-vcores)

spark.yarn.executor.memoryOverhead=15-20% of spark.executor.memory

Try to increase spark.sql.shuffle.output.partitions parameter to greater than 2000 (default 200). Hope this helps

Upvotes: 0

Related Questions