Reputation: 85
Say I have a spark job that looks like following:
def loadTable1() {
val table1 = sqlContext.jsonFile(s"s3://textfiledirectory/")
table1.cache().registerTempTable("table1")
}
def loadTable2() {
val table2 = sqlContext.jsonFile(s"s3://testfiledirectory2/")
table2.cache().registerTempTable("table2")
}
def loadAllTables() {
loadTable1()
loadTable2()
}
loadAllTables()
How do I parallelize this Spark job so that both tables are created at the same time?
Upvotes: 2
Views: 7555
Reputation: 17141
Use Futures!
implicit val ec = ExecutionContext.fromExecutor(Executors.newFixedThreadPool(10))
def loadAllTables() {
Future { loadTable1() }
Future { loadTable2() }
}
Upvotes: 0
Reputation: 27455
You don't need to parallelize it. The RDD/DF creation operations don't do anything. These data structures are lazy, so any actual calculation will only happen when you start using them. And when a Spark calculation does happen, it will be automatically parallelized (partition-by-partition). Spark will distribute the work across the executors. So you would not generally gain anything by introducing further parallelism.
Upvotes: 4
Reputation: 7442
You can do this with standard scala threading mechanism. Personally I'd like to do a list of pairs with path & table name and then parallel map over that. You could also look at futures or standard threads.
Upvotes: -1