Developing a spark streaming application

Question

so the problem i'm trying to tackle is the following:

I need a data source that emits messages at a certain frequency
There are N neural nets that need to process each message individually
The outputs from all neural nets are aggregated and only when all N outputs for each message are collected, should a message be declared fully processed
At the end i should measure the time it took for a message to be fully processed (time between when it was emitted and when all N neural net outputs from that message have been collected)

I'm curious as to how one would approach such a task using spark streaming.

My current implementation uses 3 types of components: a custom receiver and two classes that implement Function, one for the neural nets, one for the end aggregator.

In broad strokes, my application is built as follows:

JavaReceiverInputDStream<...> rndLists = jssc.receiverStream(new JavaRandomReceiver(...));

Function, Void> aggregator = new JavaSyncBarrier(numberOfNets);

for(int i = 0; i < numberOfNets; i++){
    rndLists.map(new NeuralNetMapper(neuralNetConfig)).foreachRDD(aggregator);
}

The main problem i'm having with this, though, is that it runs faster in local mode than when submitted to a 4-node cluster.

Is my implementation wrong to begin with or is something else happening here ?

There's also a full post here http://apache-spark-user-list.1001560.n3.nabble.com/Developing-a-spark-streaming-application-td12893.html with more details regarding the implementation of each of the three components mentioned previously.

Developing a spark streaming application

Answers (1)

Related Questions