Using Future inside of a spark job

Question

I want to perform 2 operations on a single RDD concurrently. I have written code like this

val conf = new SparkConf().setAppName("Foo")
val sc = new SparkContext(conf)
val sqlSc = new SQLContext(sc)
sc.hadoopConfiguration.set("mapreduce.input.fileinputformat.input.dir.recursive","true")
val inputPath = path
val rdd = sc.textFile(inputPath).cache()

val f1 = Future {
  val schama1 = StructType(List(StructField("a", StringType, true), StructField("b", StringType, true), StructField("c", LongType, true)))
  val rdd1 = rdd.map(func1).filter(_.isDefined).flatMap(x => x)
  val df1 = sqlSc.createDataFrame(rdd, schema)
  formSubmissionDataFrame.save("/foo/", "com.databricks.spark.avro")
  0
}

val f2 = Future {
  val schema2 = StructType(List(StructField("d", StringType, true), StructField("e", StringType, true)))
  val rdd2 = rdd.map(func2).filter(_.isDefined).flatMap(x => x)
  val df2 = sqlSc.createDataFrame(rdd2, schema2)
  pageViewDataFrame.save("/bar/", "com.databricks.spark.avro")
  0
}

val result = for {
  r1 <- f1
  r2 <- f2
} yield(r1 + r2)

result onSuccess{
  case r => println("done")
}

Await.result(result, Duration.Inf)

When I am running this code, I don't see the desired effect. the directory bar has lots of temporary files etc... but foo has nothing... so it seems the two datasets are not being created in parallel.

Is it a good idea to use a future inside the spark driver? am I doing it correctly? should I do anything differently.

Using Future inside of a spark job

Answers (1)

Related Questions