Bruckwald
Bruckwald

Reputation: 797

Batch lookup data for Spark streaming

I need to look up some data in a Spark-streaming job from a file on HDFS This data is fetched once a day by a batch job.
Is there a "design pattern" for such a task?

Upvotes: 2

Views: 1186

Answers (1)

zero323
zero323

Reputation: 330443

One possible approach is to drop local data structures and use stateful stream instead. Lets assume you have main data stream called mainStream:

val mainStream: DStream[T] = ???

Next you can create another stream which reads lookup data:

val lookupStream: DStream[(K, V)] = ???

and a simple function which can be used to update state

def update(
  current: Seq[V],  // A sequence of values for a given key in the current batch
  prev: Option[V]   // Value for a given key from in the previous state
): Option[V] = { 
  current
    .headOption    // If current batch is not empty take first element 
    .orElse(prev)  // If it is empty (None) take previous state
 }

This two pieces can be used to create state:

val state = lookup.updateStateByKey(update)

All whats left is to key-by mainStream and connect data:

def toPair(t: T): (K, T) = ???

mainStream.map(toPair).leftOuterJoin(state)

While this is probably less than optimal from a performance point of view it leverages architecture which is already in place and frees you from manually dealing with invalidation or failure recovery.

Upvotes: 2

Related Questions