Pythonist
Pythonist

Reputation: 85

Total allocation exceeds 95.00% (960,285,889 bytes) of heap memory- pyspark error

I wrote a script in python 2.7 that using pyspark for converting csv to parquet and other stuff. when I ran my script on a small data it works well but when I did it on a bigger data (250GB) I crush on the following error- Total allocation exceeds 95.00% (960,285,889 bytes) of heap memory. How can I solve this problem? and what is the reason that it's happening? tnx!

part of code: libraries imported: import pyspark as ps from pyspark.sql.types import StructType, StructField, IntegerType, DoubleType, StringType, TimestampType,LongType,FloatType from collections import OrderedDict from sys import argv

using pyspark:

 schema_table_name="schema_"+str(get_table_name())
 print (schema_table_name)
 schema_file= OrderedDict()

schema_list=[]
ddl_to_schema(data)
for i in schema_file:
schema_list.append(StructField(i,schema_file[i]()))

schema=StructType(schema_list)
print schema

spark = ps.sql.SparkSession.builder.getOrCreate()
df = spark.read.option("delimiter", 
",").format("csv").schema(schema).option("header", "false").load(argv[2])
df.write.parquet(argv[3])

# df.limit(1500).write.jdbc(url = url, table = get_table_name(), mode = 
  "append", properties = properties)
# df = spark.read.jdbc(url = url, table = get_table_name(), properties = 
  properties)
pq = spark.read.parquet(argv[3])
pq.show()

just to clarify the schema_table_name is meant to save all tables name ( that are in DDL that fit the csv).

function ddl_to_schema just take a regular ddl and edit it to a ddl that parquet can work with.

Upvotes: 4

Views: 6647

Answers (2)

Doron Yaacoby
Doron Yaacoby

Reputation: 9760

If you are running a local script and aren't using spark-submit directly, you can do this:

import os

os.environ["PYSPARK_SUBMIT_ARGS"] = "--driver-memory 2g"

Upvotes: 1

Andre
Andre

Reputation: 391

It seems your driver is running out of memory.

By default the driver memory is set to 1GB. Since your program used 95% of it the application ran out of memory.

you can try to change it until you reach the "sweet spot" for your needs below I'm setting it to 2GB:

pyspark --driver-memory 2g

you can play with the executor memory too, although it doesn't seem to be the problem here (the default value for the executor is 4GB).

pyspark --driver-memory 2g --executor-memory 8g

the theory is, spark actions can offload data to the driver causing it to run out of memory if not properly sized. I can't tell for sure in your case, but it seems that the write is what is causing this.

You can take a look at the theory here (read about driver program and then check the actions):

https://spark.apache.org/docs/2.2.0/rdd-programming-guide.html#actions

Upvotes: 5

Related Questions