user2037661
user2037661

Reputation: 51

Spark, efficient way to get a single value from each partitions? accumulator?

In my spark application, each partition generates a single Object, which is small and contains summary of data in the partition. Now i collect them by putting them into the Datafram that also contains main data.

val df: DataFrame[(String, Any)] = df.mapPartitions(_ => /*add Summary Object*/ )
val summaries = df.filter(_._1 == "summary").map(_._2).collect()
val data = df.filter(_._1 == "data").map(_._2)  // used to further RDD processing

The Summary object is used immediately, and data will be used in RDD processing. The problem is that, the code yields the evaluation of df twice (one in the code, another later on), which is heavy, in my app. And, cache or persist will help but i cannot use in my app.

Is there any good way to collect the object from each partition? How about accumulator?

Upvotes: 1

Views: 373

Answers (2)

ebonnal
ebonnal

Reputation: 1167

Why can't you use cache or persist? – Dennis Jaheruddin

@DennisJaheruddin because df is could be huge, hundred time bigger than my memory. – user2037661

You can use the storage level MEMORY_AND_DISK if you want to cache a dataframe that does not fit in memory. This storage level is currently the default one when calling cache or persist.

Upvotes: 1

ebonnal
ebonnal

Reputation: 1167

Make the function passed to mapPartitions return an iterator containing only the summary object. Then you can collect directly, no need of any additional filtering.

Upvotes: 1

Related Questions