Does Spark do UnionAll in parallel?

Question

I got 10 DataFrames with the same schema which I'd like to combine into one DataFrame. Each DataFrame is constructed using a sqlContext.sql("select ... from ...").cahce, which means that technically, the DataFrames are not really calculated until it's time to use them.

So, if I run:

val df_final = df1.unionAll(df2).unionAll(df3).unionAll(df4) ...

will Spark calculate all these DataFrames in parallel or one by one (due to the dot operator)?

And also, while we're here - is there a more elegant way to preform a unionAll on several DataFrames than the one I listed above?

Daniel Darabos · Accepted Answer

unionAll is lazy. The example line in your question does not trigger any calculation, synchronous or asynchronous.

In general Spark is a distributed computation system. Each operation itself is made up of a bunch of tasks that are processed in parallel. So in general you don't have to worry about whether two operations can run in parallel or not. The cluster resources will be well utilized anyway.

Does Spark do UnionAll in parallel?

Answers (1)

Related Questions