Reputation: 687
I have been trying to get a PySpark job to work which creates a RDD with a bunch of binary files, and then I use a flatMap
operation to process the binary data into a bunch of rows. This has lead to a bunch of out of memory errors, and after playing around with memory settings for a while I have decided to get the simplest thing possible working, which is just counting the number of files in the RDD.
This also fails with OOM error. So I opened up both the spark-shell and PySpark and ran the commands in the REPL/shell with default settings, the only additional parameter was --master yarn.
The spark-shell
version works, while the PySpark version shows the same OOM error.
Is there that much overhead to running PySpark? Or is this a problem with binaryFiles
being new? I am using Spark version 2.2.0.2.6.4.0-91.
Upvotes: 0
Views: 648
Reputation: 546
Since you are reading multiple binary files with binaryFiles() and starting Spark 2.1, the minPartitions argument of binaryFiles() is ignored
1.try to repartition the input files based on the following:
enter code here
rdd = sc.binaryFiles(Path to the binary file , minPartitions = ).repartition()
2.You may try reducing the partition size to 64 MB or less depending on your size of the data using below config's
spark.files.maxPartitionBytes, default 128 MB
spark.files.openCostInBytes, default 4 MB
spark.default.parallelism
Upvotes: 1
Reputation: 21
The difference:
PortableDataStream
- this means process is lazy, and unless you call toArray
on the values, won't load data at all.Additionally PySpark will use at least twice as much memory - for Java and Python copy.
Finally binaryFiles
(same as wholeTextFiles
) are very inefficient and don't perform well, if individual input files are large. In case like this it is better to implement format specific Hadoop input format.
Upvotes: 2