Daniyal
Daniyal

Reputation: 905

Spark Memory Error at Parallelization Step

we are using the most recent Spark build. We have as input a very large list of tuples (800 Mio.). We run our Pyspark program using docker containers with a master and multiple worker nodes. A driver is used to run the program and connect to the master.

When running the program, at the line sc.parallelize(tuplelist) the program either quits with an java heap error message or quits without an error at all. We do not use any Hadoop HDFS layer and no YARN.

We have so far considered the possible factors as mentioned in these SO postings:

At this points we have the following question(s):

Upvotes: 1

Views: 556

Answers (1)

data_addict
data_addict

Reputation: 894

How do we know how many partitions we should use for the sc.parallelize step? What is here a good rule-of-thumb?

Ans: There are multiple factors to decide the number of partitions.

1) There may be cases where having number of partitions 3-4 X times of your cores will be good case(considering each partition going to be processing more than few secs)

2) Partitions shouldn't be too small or too large(128MB or 256MB) will be good enough

Do you know any (common?) mistake which may lead to the observed behevior?

Can you check the executor memory and disk that is available to run the size.

If you can specify more details about the job e.g. number of cores, executor memory, number of executors and disk available it will be helpful to point out the issue.

Upvotes: 1

Related Questions