Abhinav Kumar
Abhinav Kumar

Reputation: 240

How does Spark repartitioning work w.r.t to the input file partitioning?

I have 2 questions:

  1. Can we have less partitions set in a call to coalesce than the HDFS block size? e.g. Suppose I have a 1 GB file size and HDFS block size is 128MB, can I do coalesce(1)?

  2. As we know, input files on HDFS are physically split on the basis of block size. Does Spark further split the data (physically) when we repartition, or change parallelism?

Upvotes: 0

Views: 205

Answers (1)

Chris
Chris

Reputation: 1455

e.g suppose I have a 1 GB file size and hdfs block size is 128MB. can I do coalesce(1)?

Yes, you can coalesce to a single file and write that to an external file system (at least with EMRFS)

does spark further splits the data (physically) when we repartition or change parallelism ?

repartition slices the data into partitions independently of the partitioning of the original input files.

Upvotes: 1

Related Questions