Reputation: 944
I wrote a simple program that request a huge database. To export my result, I wrote this function:
result.coalesce(1).write.options(Map("header" -> "true", "delimiter" > ";")).csv(mycsv.csv)
I use the coalesce
method to have only get one file as an output. The problem is that the result file contains more than one million lines. So, I couldn't open it in Excel...
So, I thought about using a method (or write my own function using a for loop) that can create partitions related to the number of the lines in my file. But I have no idea how can I do this.
My idea is that if I have less than one million line, I will have one partition. If I have more than one million => two partitions, 2 millions => 3 partitions and so on.
Is it possible to do something like this?
Upvotes: 2
Views: 11018
Reputation: 28392
You can change the number of partition depending on the number of rows in the dataframe.
For example:
val rowsPerPartition = 1000000
val partitions = (1 + df.count() / rowsPerPartition).toInt
val df2 = df.repartition(numPartitions=partitions)
Then write the new dataframe to a csv file as before.
Note: it may be required to use repartition
instead of coalesce
to make sure the number of rows in each partition are roughly equal, see Spark - repartition() vs coalesce().
Upvotes: 11