theseadroid
theseadroid

Reputation: 471

Is SparkContext.newAPIHadoopFile API reading in and processing single file in parallel?

I need to use Spark to read a huge uncompressed text file (>20GB) into RDD. Each record in the file spans multiple lines (<20 lines per record) so I can't use sc.textFile. I'm considering using SparkContext.newAPIHadoopFile with a custom delimiter. However since the file is fairly big, I'm curious if the reading and parsing will happen distributedly across multiple Spark executors, or only one node?

File content looks as follow:

record A
content for record A
content for record A
content for record A
record B
content for record B
content for record B
content for record B
...

Upvotes: 0

Views: 640

Answers (1)

Gelerion
Gelerion

Reputation: 1704

It depends on your input format and mostly on compression codec. E.g. gzip is not splittable but Snappy is.

If it is splitable Hadoop API will take care of it according to its' split size config:

minSize = Math.max(getFormatMinSplitSize(), getMinSplitSize(job));
maxSize = getMaxSplitSize(job);

for each file

blockSize = file.getBlockSize();
splitSize = computeSplitSize(blockSize, minSize, maxSize); 

Then each split will become a partition and will be distributed across the cluster.

Upvotes: 1

Related Questions