Reputation: 61
For example I am doing a lot of calculations and I am finally down to 3 dataframes.
for example:
val mainQ = spark.sql("select * from employee")
mainQ.createOrReplaceTempView("mainQ")
val mainQ1 = spark.sql("select state,count(1) from mainQ group by state")
val mainQ2 = spark.sql("select dept_id,sum(salary) from mainQ group by dept_id")
val mainQ3 = spark.sql("select dept_id,state , sum(salary) from mainQ group by dept_id,state")
//Basically I want to write below writes in parallel. I could put into
//Different files. But that is not what I am looking at. Once all computation is done. I want to write the data in parallel.
mainQ1.write.mode("overwrite").save("/user/h/mainQ1.txt")
mainQ2.write.mode("overwrite").save("/user/h/mainQ2.txt")
mainQ3.write.mode("overwrite").save("/user/h/mainQ3.txt")
Upvotes: 3
Views: 715
Reputation: 27383
Normally there is no benefit using multi-threading in the driver code, but sometimes it can increase performance. I had some situations where launching parallel spark jobs increased performance drastically, namely when the individual jobs do not utilize the cluster resources well (e.g. due to data skew, too few partitions etc). In your case you can do:
ParSeq(
(mainQ1,"/user/h/mainQ1.txt"),
(mainQ2,"/user/h/mainQ2.txt"),
(mainQ3,"/user/h/mainQ3.txt")
).foreach{case (df,filename) =>
df.write.mode("overwrite").save(filename)
}
Upvotes: 3