Reputation: 1174
Is it possible to execute foreach on a dataframe so that I can return a dataset? I have a requirement that can only be satisfied by processing the records in order, so I am using foreach over the dataframe, but I need to create a new dataset from the result so I can write it into a parquet output file. This pseudo-code is what I would like to accomplish:
dataframe.foreachPartition(
it => {
/// process records . . .
/// write the results form this partition into a file for aggregation later
sparkSession.write . . .
}
);
// read a dataframe containing all the data sets written by the tasks
sparkSession.read . . .
I know that is pretty sparse, but that summarizes what I need to do. The call to sparkSession.write is not allowed inside the foreach so I am wondering if there is another way.
Upvotes: 2
Views: 1990
Reputation: 7316
Actually you don't have access to dataframes or datasets within foreachPartition and that's because datasets and dataframes likewise other spark entities as session are available only from the driver code.
Although one alternative would be to generate the parquet files directly using Hadoop API within foreachPartition since the data of your partition is accessible:
dfB.repartition(2).foreachPartition( iter => {
iter.foreach(i => println(i))
})
Here another thread that describes this issue and its solution in depth
Good luck
Upvotes: 1