Code reuse between spark streaming and batch mode for distinct elements

Question

I'm a Spark newbie and I want to implement a lambda architecture using spark streaming and spark batch.

Looking on the web I found the following article:

http://blog.cloudera.com/blog/2014/08/building-lambda-architecture-with-spark-streaming/

This is fine for some of my analysis but I don't think this solution is feasible in the case distinct elements have to be found.

If you want to find distinct elements on a JavaRDD you can use the method distinct. DStreams are sets of RDDs so if you apply

transform((rdd) -> rdd.distinct())

method on the Dstream you will execute the distinct on each rdd of the stream so you will find distinct elements in each RDD and not on the whole DStream.

May be written like that is a bit confusing so let me clarify with an example:

I have the following elements:

Apple
Pear
Banana
Peach
Apple
Pear

In the batch app:

JavaRDD elemsRDD=sc.textFile(exFilePath).distinct()

The child RDD will contain:

Apple
Pear
Banana
Peach

If I have understood correctly this should be the behaviour for the stream:

suppose we have a batch time of 1s and a window of 2s:

First RDD:

Apple  
Pear
Banana

Second RDD:

Peach
Apple
Pear

JavaDStream elemsStream=(getting from whathever source)
childStream = elemsStream.transform((rdd) -> rdd.distinct())
childStream.forEachRDD...

will end up with 2 Rdds: First:

Apple  
Pear
Banana

Second:

Peach
Apple
Pear

That are distinct respect to the RDD but not respect to the DStream.

My solution for the Streaming part was the following:

JavaDStream> distinctElems = elemsStream.map(
                (elem) -> {
                    HashSet htSet = new HashSet();
                    htSet.add(elem);
                    return htSet;
                }).reduce((sp1, sp2) -> {
                    sp1.addAll(sp2);
                    return sp1;
                });

In this way the result is:

Apple
Pear
Banana
Peach

as the batch mode. However this solution will require maintenance overhead and have risk for errors resulting from duplicate code bases.

Is there a better way to reach the same outcome reusing as much as possible the code for the batch mode?

Thanks in advance.

Code reuse between spark streaming and batch mode for distinct elements

Answers (1)

Related Questions