spowers
spowers

Reputation: 687

Memory difference between pyspark and spark?

I have been trying to get a PySpark job to work which creates a RDD with a bunch of binary files, and then I use a flatMap operation to process the binary data into a bunch of rows. This has lead to a bunch of out of memory errors, and after playing around with memory settings for a while I have decided to get the simplest thing possible working, which is just counting the number of files in the RDD.

This also fails with OOM error. So I opened up both the spark-shell and PySpark and ran the commands in the REPL/shell with default settings, the only additional parameter was --master yarn. The spark-shellversion works, while the PySpark version shows the same OOM error.

Is there that much overhead to running PySpark? Or is this a problem with binaryFiles being new? I am using Spark version 2.2.0.2.6.4.0-91.

Upvotes: 0

Views: 648

Answers (2)

Gangadhar Kadam
Gangadhar Kadam

Reputation: 546

Since you are reading multiple binary files with binaryFiles() and starting Spark 2.1, the minPartitions argument of binaryFiles() is ignored

1.try to repartition the input files based on the following: enter code hererdd = sc.binaryFiles(Path to the binary file , minPartitions = ).repartition()

2.You may try reducing the partition size to 64 MB or less depending on your size of the data using below config's

spark.files.maxPartitionBytes, default 128 MB
spark.files.openCostInBytes, default 4 MB
spark.default.parallelism

Upvotes: 1

user10210302
user10210302

Reputation: 21

The difference:

  • Scala will load records as PortableDataStream - this means process is lazy, and unless you call toArray on the values, won't load data at all.
  • Python will call Java backend, but load the data as byte array. This part will be eager-ish, therefore might fail on both sides.

Additionally PySpark will use at least twice as much memory - for Java and Python copy.

Finally binaryFiles (same as wholeTextFiles) are very inefficient and don't perform well, if individual input files are large. In case like this it is better to implement format specific Hadoop input format.

Upvotes: 2

Related Questions