lee
lee

Reputation: 159

How can I deal with each adjoin two element difference greater than threshold from Spark RDD

I have a problem with Spark Scala which get the value of each adjoin two element difference greater than threshold,I create a new RDD like this:

  [2,3,5,8,19,3,5,89,20,17]

I want to subtract each two adjoin element like this:

 a.apply(1)-a.apply(0) ,a.apply(2)-a.apply(1),…… a.apply(a.lenght)-a.apply(a.lenght-1)

If the result greater than the threshold of 10,than output the collection,like this:

[19,89]

How can I do this with scala from RDD?

Upvotes: 1

Views: 167

Answers (2)

koiralo
koiralo

Reputation: 23119

You can create another RDD from the original dataframe and zip those two RDD which creates a tuple like (2,3)(3,5)(5,8) and filter the subtracted result if it is greater than 10

val rdd = spark.sparkContext.parallelize(Seq(2,3,5,8,19,3,5,89,20,17))

val first = rdd.first()
rdd.zip(rdd.filter(r => r != first))
  .map( k => ((k._2 - k._1), k._2))
  .filter(k => k._1 > 10 )
  .map(t => t._2).foreach(println)

Hope this helps!

Upvotes: 0

Ramesh Maharjan
Ramesh Maharjan

Reputation: 41987

If you have data as

val data = Seq(2,3,5,8,19,3,5,89,20,17)

you can create rdd as

val rdd = sc.parallelize(data)

What you desire can be achieved by doing the following

import org.apache.spark.mllib.rdd.RDDFunctions._
 val finalrdd = rdd
                  .sliding(2)
                  .map(x => (x(1), x(1)-x(0)))
                  .filter(y => y._2 > 10)
                  .map(z => z._1)

Doing

finalrdd.foreach(println)

should print

19
89

Upvotes: 1

Related Questions