Reputation: 93

Spark-streaming for task parallelization

I am designing a system with the following flow:

Download feed files (line based) over the network
Parse the elements into objects
Filter invalid / unnecessary objects
Execute blocking IO (HTTP Request) on part of the elements
Save to DB

I have been considering implementing the system using Spark-streaming mainly for tasks parallelization, resource management, fault tolerance, etc.

But I am not sure this is the right use-case for spark streaming, as I am not using it only for metrics and data processing. Also I'm not sure how Spark-streaming handles blocking IO tasks.

Is Spark-streaming suitable for this use-case? Or maybe I should look for another technology/framework?

Upvotes: 1

Answers (2)

maasg

Reputation: 37435

Spark is, at its heart, a general parallel computing framework. Spark Streaming adds an abstraction to support stream processing using micro-batching. We can certainly implement such an use case on Spark Streaming.

To 'fan-out' the I/O operations, we need to ensure the right level of parallelism at two levels:

First, distribute the data evenly across partitions: The initial partitioning of the data will depend on the streaming source used. For this usecase, it would look like a custom receiver could be the way to go. After the batch is received, we probably need to use dstream.repartition(n) to a larger number of partitions that should roughly match 2-3x the number of executors allocated for the job.
Spark uses 1 core (configurable) for each task executed. Tasks are executed per partition. This makes the assumption that our task is CPU intensive and requires a full CPU. To optimize execution for blocking I/O, we would like to multiplex that core for many operations. We do this by operating directly on the partitions and using classical concurrent programming to parallelize our work.

Given the original stream of feedLinesDstream, we could so something like: (* in Scala. Java version should be similar, but like x times more LOC)

val feedLinesDstream = ??? // the original dstream of feed lines
val parsedElements = feedLinesDstream.map(parseLine)
val validElements = parsedElements.filter(isValid _)
val distributedElements = validElements.repartition(n) // n = 2 to 3 x #of executors

// multiplex execution at the level of each partition
val data =  distributedElements.mapPartitions{ iter =>
   implicit executionContext = ??? // obtain a thread pool for execution
   val futures = iter.map(elem => Future(ioOperation(elem)))
   // traverse the future resulting in a future collection of results
   val res = Future.sequence(future) 
   Await.result(res, timeout)
}
data.saveToCassandra(keyspace, table)

Upvotes: 3

Govinda Tamburino

Reputation: 23

Is Spark-streaming suitable for this use-case? Or maybe I should look for another technology/framework?

When considering using Spark, you should ask yourself a few questions:

What is the scale of my application in it's current state and where will it grow to in the future? (Spark is generally meant for Big Data applications where millions of processes will happen a second)
What language is my preferred? (Spark can implemented in Java, Scala, Python, and R)
What database will I be using? (Technologies like Apache Spark are normally implemented with large DB structures like HBase)

Also I'm not sure how Spark-streaming handles blocking IO tasks.

There is already an answer on Stack Overflow about blocking IO tasks using Spark in Scala. It should give you a start, but to answer that question, yes it is possible.

Lastly, reading documentation is important and you can find Spark's right here.

Upvotes: 1

Spark-streaming for task parallelization

Answers (2)

Related Questions