How does Spark repartitioning work w.r.t to the input file partitioning?

Question

I have 2 questions:

Can we have less partitions set in a call to coalesce than the HDFS block size? e.g. Suppose I have a 1 GB file size and HDFS block size is 128MB, can I do coalesce(1)?
As we know, input files on HDFS are physically split on the basis of block size. Does Spark further split the data (physically) when we repartition, or change parallelism?

Chris · Accepted Answer

e.g suppose I have a 1 GB file size and hdfs block size is 128MB. can I do coalesce(1)?

Yes, you can coalesce to a single file and write that to an external file system (at least with EMRFS)

does spark further splits the data (physically) when we repartition or change parallelism ?

repartition slices the data into partitions independently of the partitioning of the original input files.

Answers (1)