monster
monster

Reputation: 1782

Apache Spark (Scala) - print 1 entry of an RDD / pairRDD

When using an RDD I have grouped the items within the RDD by Key.

    val pairRDD = oldRDD.map(x => (x.user, x.product)).groupByKey

pairRDD is of type: RDD(Int, Iterable[Int]))

What I am having trouble with is simply accessing a particular element. What is the point of having a key when I can't seemingly access the item in the RDD by key?

At the minute I filter the RDD down to a single item, however I still have an RDD, and as such I have to do a foreach on the RDD to print it out:

    val userNumber10 = pairRDD.filter(_._1 == 10)
    userNumber10.foreach(x => println("user number = " + x._1))

Alternatively, I can filter the RDD and then take(1) which returns an Array of size 1:

    val userNumber10Array = pairRDD.filter(_._1 == 10).take(1)

Alternatively to that I can select the first element of that returned array:

    val userNumber10Array = pairRDD.filter(_._1 == 10).take(1)(0)

Which returns me the pair as required. But... clearly, this is inconvenient and I would hazard a guess at saying that this is not how an RDD is meant to be used!

Why am I doing this you may ask! Well, the reason it's come about is because I simply wanted to "see" what was in my RDD for my own testing purposes. So, is there a way to access individual items in an RDD (more strictly a pairRDD) and if so, how? If not, what is the purpose of a pairRDD?

Upvotes: 1

Views: 3287

Answers (1)

ale64bit
ale64bit

Reputation: 6242

Use the lookup function, which belongs to PairRDDFunctions. From the official documentation:

Return the list of values in the RDD for key key. This operation is done efficiently if the RDD has a known partitioner by only searching the partition that the key maps to.

https://spark.apache.org/docs/0.8.1/api/core/org/apache/spark/rdd/PairRDDFunctions.html

And if you just want to see the contents of your RDD, you simply call collect.

Upvotes: 4

Related Questions