Is SparkContext.newAPIHadoopFile API reading in and processing single file in parallel?

Question

I need to use Spark to read a huge uncompressed text file (>20GB) into RDD. Each record in the file spans multiple lines (<20 lines per record) so I can't use sc.textFile. I'm considering using SparkContext.newAPIHadoopFile with a custom delimiter. However since the file is fairly big, I'm curious if the reading and parsing will happen distributedly across multiple Spark executors, or only one node?

File content looks as follow:

record A
content for record A
content for record A
content for record A
record B
content for record B
content for record B
content for record B
...

Is SparkContext.newAPIHadoopFile API reading in and processing single file in parallel?

Answers (1)

Related Questions