Brandon
Brandon

Reputation: 85

How do you parallelize RDD / DataFrame creation in Spark?

Say I have a spark job that looks like following:

def loadTable1() {
  val table1 = sqlContext.jsonFile(s"s3://textfiledirectory/")
  table1.cache().registerTempTable("table1")
}  

def loadTable2() {
  val table2 = sqlContext.jsonFile(s"s3://testfiledirectory2/")
  table2.cache().registerTempTable("table2")
} 


def loadAllTables() {
  loadTable1()
  loadTable2()
}

loadAllTables()

How do I parallelize this Spark job so that both tables are created at the same time?

Upvotes: 2

Views: 7555

Answers (3)

BAR
BAR

Reputation: 17141

Use Futures!

implicit val ec = ExecutionContext.fromExecutor(Executors.newFixedThreadPool(10))

def loadAllTables() {
  Future { loadTable1() }
  Future { loadTable2() }
}

Upvotes: 0

Daniel Darabos
Daniel Darabos

Reputation: 27455

You don't need to parallelize it. The RDD/DF creation operations don't do anything. These data structures are lazy, so any actual calculation will only happen when you start using them. And when a Spark calculation does happen, it will be automatically parallelized (partition-by-partition). Spark will distribute the work across the executors. So you would not generally gain anything by introducing further parallelism.

Upvotes: 4

Holden
Holden

Reputation: 7442

You can do this with standard scala threading mechanism. Personally I'd like to do a list of pairs with path & table name and then parallel map over that. You could also look at futures or standard threads.

Upvotes: -1

Related Questions