Cache and Query a Dataset In Parallel Using Spark

Question

I have a requirement where I want to cache a dataset and then compute some metrics by firing "N" number of queries in parallel over that dataset and all these queries compute similar metrics just that the filters would change and I want to run these queries in parallel because response time is crucial and the dataset which I would like to cache will be always less than a GB in size.

I know how to cache a dataset in Spark and then query it subsequently, but If I have to run queries in parallel over the same dataset, how can I achieve the same ? Introducing alluxio is one way, but any other way we can achieve the same in Spark world ?

For example with Java, I can cache the data in memory and then by using multi threading I can achieve the same, but how to do it in Spark ?

Raphael Roth · Accepted Answer

It can be very simple to fire parallel queries in Spark's driver code using Scala's parallel collections. Here a minimal example how this could look like:

val dfSrc = Seq(("Raphael",34)).toDF("name","age").cache()


// define your queries, instead of returning a dataframe you could also write to a table etc
val query1: (DataFrame) => DataFrame = (df:DataFrame) => df.select("name")
val query2: (DataFrame) => DataFrame = (df:DataFrame) => df.select("age")

// Fire queries in parallel
import scala.collection.parallel.ParSeq
ParSeq(query1,query2).foreach(query => query(dfSrc).show())

EDIT:

To collect Query-ID and Result in a map you should so:

val resultMap  = ParSeq(
 (1,query1), 
 (2,query2)
).map{case (queryId,query) => (queryId,query(dfSrc))}.toMap

Cache and Query a Dataset In Parallel Using Spark

Answers (1)

Related Questions