Reputation: 2893
I got 10 DataFrame
s with the same schema which I'd like to combine into one DataFrame
. Each DataFrame
is constructed using a sqlContext.sql("select ... from ...").cahce
, which means that technically, the DataFrame
s are not really calculated until it's time to use them.
So, if I run:
val df_final = df1.unionAll(df2).unionAll(df3).unionAll(df4) ...
will Spark calculate all these DataFrame
s in parallel or one by one (due to the dot operator)?
And also, while we're here - is there a more elegant way to preform a unionAll
on several DataFrame
s than the one I listed above?
Upvotes: 1
Views: 1426
Reputation: 27455
unionAll
is lazy. The example line in your question does not trigger any calculation, synchronous or asynchronous.
In general Spark is a distributed computation system. Each operation itself is made up of a bunch of tasks that are processed in parallel. So in general you don't have to worry about whether two operations can run in parallel or not. The cluster resources will be well utilized anyway.
Upvotes: 3