Reputation: 646
I want to split up a dataframe of 2,7 million rows into small dataframes of 100000 rows, so end up with like 27 dataframes, which I want to store as csv files too.
I took a look at this partitionBy and groupBy already, but I don't need to worry about any conditions, except that they have to be ordered by date. I am trying to write my own code to make this work, but if you know about some Scala (Spark) functions I could use, that would be great!
Thank you all for the suggestions!
Upvotes: 2
Views: 1625
Reputation: 10406
You could use zipWithIndex
from the RDD API (no equivalent in SparkSQL unfortunately) that maps each row to an index, ranging between 0
and rdd.count - 1
.
So if you have a dataframe that I assumed to be sorted accordingly, you would need to go back and forth between the two APIs as follows:
import org.apache.spark.sql.types._
import org.apache.spark.sql.Row
// creating mock data
val df = spark.range(100).withColumn("test", 'id % 10)
// zipping the data
val partitionSize = 5 // I use 5 but you can use 100000 in your case
val zipped_rdd = df.rdd
.zipWithIndex.map{ case (row, id) =>
Row.fromSeq(row.toSeq :+ id / partitionSize )
}
//back to df
val newField = StructField("partition", LongType, false)
val zipped_df = spark
.createDataFrame(zipped_rdd, df.schema.add(newField))
Let's have a look at the data, we have a new column called partition and that corresponds to the way you want to split your data.
zipped_df.show(15) // 5 rows by partition
+---+----+---------+
| id|test|partition|
+---+----+---------+
| 0| 0| 0|
| 1| 1| 0|
| 2| 2| 0|
| 3| 3| 0|
| 4| 4| 0|
| 5| 5| 1|
| 6| 6| 1|
| 7| 7| 1|
| 8| 8| 1|
| 9| 9| 1|
| 10| 0| 2|
| 11| 1| 2|
| 12| 2| 2|
| 13| 3| 2|
| 14| 4| 2|
+---+----+---------+
// using partitionBy to write the data
zipped_df.write
.partitionBy("partition")
.csv(".../testPart.csv")
Upvotes: 2