Reputation: 1459
How to split a large file/RDD/DF into small files and save to different paths.
ex: If there is a file that contains usernames(single column)in a text file and wants to split that into N files and write that N files into different directories.
val x=20
val namesRDD=sc.textfile("readPath")
val N = namesRDD.count/x
How to split the namesRDD into N files and write those to some "savepath/N/" - i.e first file is written to "savepath/1/", the second file is written to "savepath/2/" and so on.
Upvotes: 0
Views: 4474
Reputation: 1459
split the file/df into N parts using repartition(if there are no columns to do repartitionByRange and want to split randomly)
df.repartition(N)
.write.text(storePath)
then read those partitions (do whatever on that partitioned Df)
for (i <- 0 until N) {
val parts = f"${i}%04d"
val splitPath = s"${path}/part-0${partNumber}-*"
//read data from the `splitPath`
}
Upvotes: 1
Reputation: 922
Using repartitionByRange
will let you split your data this way.
example:
df.repartitionByRange($"region").write.csv("data/regions")
This will create 1 part file for every region
that appears in your data. If you have 10 regions, you will have 10 different part-
files.
If you want to specify your own name, you will have to apply your own function to save the file with foreachPartition
.
df.repartitionByRange($"region")
.foreachPartition(region => {
// custom implementation
})
Upvotes: 0