vkt
vkt

Reputation: 1459

split large file into small files and save in different paths using spark

How to split a large file/RDD/DF into small files and save to different paths.

ex: If there is a file that contains usernames(single column)in a text file and wants to split that into N files and write that N files into different directories.

val x=20
val namesRDD=sc.textfile("readPath")
val N = namesRDD.count/x

How to split the namesRDD into N files and write those to some "savepath/N/" - i.e first file is written to "savepath/1/", the second file is written to "savepath/2/" and so on.

Upvotes: 0

Views: 4474

Answers (2)

vkt
vkt

Reputation: 1459

split the file/df into N parts using repartition(if there are no columns to do repartitionByRange and want to split randomly)

df.repartition(N)
  .write.text(storePath)

then read those partitions (do whatever on that partitioned Df)

  for (i <- 0 until N) {
    val parts = f"${i}%04d"
    val splitPath = s"${path}/part-0${partNumber}-*"
    //read data from the `splitPath`
  }

Upvotes: 1

emran
emran

Reputation: 922

Using repartitionByRange will let you split your data this way.

example:

df.repartitionByRange($"region").write.csv("data/regions")

This will create 1 part file for every region that appears in your data. If you have 10 regions, you will have 10 different part- files.

If you want to specify your own name, you will have to apply your own function to save the file with foreachPartition.

df.repartitionByRange($"region")
  .foreachPartition(region => {
     // custom implementation
  })

Upvotes: 0

Related Questions