user3539924
user3539924

Reputation: 61

writing 2 data frames in parallel in scala

For example I am doing a lot of calculations and I am finally down to 3 dataframes.

for example:

val mainQ = spark.sql("select * from employee")
mainQ.createOrReplaceTempView("mainQ")
val mainQ1 = spark.sql("select state,count(1) from mainQ group by state")
val mainQ2 = spark.sql("select dept_id,sum(salary) from mainQ group by dept_id")
val mainQ3 = spark.sql("select  dept_id,state , sum(salary) from mainQ     group by dept_id,state")
//Basically I want to write below writes in parallel. I could put into 
//Different files. But that is not what I am looking at. Once all         computation is done. I want to write the data in parallel.
mainQ1.write.mode("overwrite").save("/user/h/mainQ1.txt")
mainQ2.write.mode("overwrite").save("/user/h/mainQ2.txt")
mainQ3.write.mode("overwrite").save("/user/h/mainQ3.txt")

Upvotes: 3

Views: 715

Answers (1)

Raphael Roth
Raphael Roth

Reputation: 27383

Normally there is no benefit using multi-threading in the driver code, but sometimes it can increase performance. I had some situations where launching parallel spark jobs increased performance drastically, namely when the individual jobs do not utilize the cluster resources well (e.g. due to data skew, too few partitions etc). In your case you can do:

ParSeq(
  (mainQ1,"/user/h/mainQ1.txt"),
  (mainQ2,"/user/h/mainQ2.txt"),
  (mainQ3,"/user/h/mainQ3.txt")
).foreach{case (df,filename) => 
  df.write.mode("overwrite").save(filename)
}

Upvotes: 3

Related Questions