Reputation: 240
I have 2 questions:
Can we have less partitions set in a call to coalesce
than the HDFS block size? e.g. Suppose I have a 1 GB file size and HDFS block size is 128MB, can I do coalesce(1)
?
As we know, input files on HDFS are physically split on the basis of block size. Does Spark further split the data (physically) when we repartition, or change parallelism?
Upvotes: 0
Views: 205
Reputation: 1455
e.g suppose I have a 1 GB file size and hdfs block size is 128MB. can I do coalesce(1)?
Yes, you can coalesce to a single file and write that to an external file system (at least with EMRFS)
does spark further splits the data (physically) when we repartition or change parallelism ?
repartition
slices the data into partitions independently of the partitioning of the original input files.
Upvotes: 1