Reputation: 51
In my spark application, each partition generates a single Object, which is small and contains summary of data in the partition. Now i collect them by putting them into the Datafram that also contains main data.
val df: DataFrame[(String, Any)] = df.mapPartitions(_ => /*add Summary Object*/ )
val summaries = df.filter(_._1 == "summary").map(_._2).collect()
val data = df.filter(_._1 == "data").map(_._2) // used to further RDD processing
The Summary object is used immediately, and data
will be used in RDD processing.
The problem is that, the code yields the evaluation of df
twice (one in the code, another later on), which is heavy, in my app. And, cache
or persist
will help but i cannot use in my app.
Is there any good way to collect the object from each partition? How about accumulator?
Upvotes: 1
Views: 373
Reputation: 1167
Why can't you use cache or persist? – Dennis Jaheruddin
@DennisJaheruddin because df is could be huge, hundred time bigger than my memory. – user2037661
You can use the storage level MEMORY_AND_DISK
if you want to cache a dataframe that does not fit in memory. This storage level is currently the default one when calling cache
or persist
.
Upvotes: 1
Reputation: 1167
Make the function passed to mapPartitions
return an iterator containing only the summary object. Then you can collect directly, no need of any additional filtering.
Upvotes: 1