Ali
Ali

Reputation: 267049

Spark scheduling / architecture confusion

I'm attempting to setup a Spark cluster, using the standalone / internal Spark cluster (not Yarn or Mesos). I'm trying to understand how things need to be architected.

Here's my understanding:

My questions are:

Upvotes: 1

Views: 134

Answers (1)

T. Gawęda
T. Gawęda

Reputation: 16076

  1. You can run your application from non-worker node - it's called client mode. If you run your application inside some worker node, it's called cluster mode. Both of them are possible.

  2. Please take a look at Spark Streaming, it seems that it will fits your requirements. You can specify that every one hour data will be collected and computation will start. You can also create cron task that will be executing spark-submit.

  3. Yes, recommended way if throught spark-submit script. You can however run this script from cron jobs, from Marathon, Oozie. It very depends on what you want to do.

If you want more info please write more about your use case, I'll try to update my answer with more precise information

Update after comment: I recommend looking at Spark Streaming - it has connector to Kafka and you can write aggregations or custom processing, via foreachRDD, to data received from specific topics. Pseudocode of algorithm:

val ssc = new StreamingContext(sparkConf, Seconds(2))
val directKafkaStream = KafkaUtils.createDirectStream[
     [key class], [value class], [key decoder class], [value decoder class] ](
     streamingContext, [map of Kafka parameters], [set of topics to consume])
val topicFirst = directKafkaStream.filter (_._1 == "topic1")
val topic2 = directKafkaStream.filter (_._1 == "topic2")

topicFirst.foreachRDD (rdd => {
    // do some processing with data collected from specified time window
});

About cron, you can ivoke nohup with spark-submit. However it's better to have one long-running jobs than many small if you must execute them in small time intervals. However it seems Spark Streaming will be good for you, them you'll have one long running job. Mandatory Word Count example is here :)

Upvotes: 1

Related Questions