Rajiv
Rajiv

Reputation: 412

Cache and Query a Dataset In Parallel Using Spark

I have a requirement where I want to cache a dataset and then compute some metrics by firing "N" number of queries in parallel over that dataset and all these queries compute similar metrics just that the filters would change and I want to run these queries in parallel because response time is crucial and the dataset which I would like to cache will be always less than a GB in size.

I know how to cache a dataset in Spark and then query it subsequently, but If I have to run queries in parallel over the same dataset, how can I achieve the same ? Introducing alluxio is one way, but any other way we can achieve the same in Spark world ?

For example with Java, I can cache the data in memory and then by using multi threading I can achieve the same, but how to do it in Spark ?

Upvotes: 3

Views: 1485

Answers (1)

Raphael Roth
Raphael Roth

Reputation: 27373

It can be very simple to fire parallel queries in Spark's driver code using Scala's parallel collections. Here a minimal example how this could look like:

val dfSrc = Seq(("Raphael",34)).toDF("name","age").cache()


// define your queries, instead of returning a dataframe you could also write to a table etc
val query1: (DataFrame) => DataFrame = (df:DataFrame) => df.select("name")
val query2: (DataFrame) => DataFrame = (df:DataFrame) => df.select("age")

// Fire queries in parallel
import scala.collection.parallel.ParSeq
ParSeq(query1,query2).foreach(query => query(dfSrc).show())

EDIT:

To collect Query-ID and Result in a map you should so:

val resultMap  = ParSeq(
 (1,query1), 
 (2,query2)
).map{case (queryId,query) => (queryId,query(dfSrc))}.toMap

Upvotes: 3

Related Questions