Reputation: 471
I need to use Spark to read a huge uncompressed text file (>20GB) into RDD. Each record in the file spans multiple lines (<20 lines per record) so I can't use sc.textFile
. I'm considering using SparkContext.newAPIHadoopFile
with a custom delimiter. However since the file is fairly big, I'm curious if the reading and parsing will happen distributedly across multiple Spark executors, or only one node?
File content looks as follow:
record A
content for record A
content for record A
content for record A
record B
content for record B
content for record B
content for record B
...
Upvotes: 0
Views: 640
Reputation: 1704
It depends on your input format and mostly on compression codec. E.g. gzip is not splittable but Snappy is.
If it is splitable Hadoop API will take care of it according to its' split size config:
minSize = Math.max(getFormatMinSplitSize(), getMinSplitSize(job));
maxSize = getMaxSplitSize(job);
for each file
blockSize = file.getBlockSize();
splitSize = computeSplitSize(blockSize, minSize, maxSize);
Then each split will become a partition and will be distributed across the cluster.
Upvotes: 1