Reputation: 1782
When using an RDD I have grouped the items within the RDD by Key.
val pairRDD = oldRDD.map(x => (x.user, x.product)).groupByKey
pairRDD
is of type: RDD(Int, Iterable[Int]))
What I am having trouble with is simply accessing a particular element. What is the point of having a key when I can't seemingly access the item in the RDD by key?
At the minute I filter
the RDD down to a single item, however I still have an RDD, and as such I have to do a foreach
on the RDD to print it out:
val userNumber10 = pairRDD.filter(_._1 == 10)
userNumber10.foreach(x => println("user number = " + x._1))
Alternatively, I can filter
the RDD and then take(1)
which returns an Array of size 1:
val userNumber10Array = pairRDD.filter(_._1 == 10).take(1)
Alternatively to that I can select the first element of that returned array:
val userNumber10Array = pairRDD.filter(_._1 == 10).take(1)(0)
Which returns me the pair as required. But... clearly, this is inconvenient and I would hazard a guess at saying that this is not how an RDD is meant to be used!
Why am I doing this you may ask! Well, the reason it's come about is because I simply wanted to "see" what was in my RDD for my own testing purposes. So, is there a way to access individual items in an RDD (more strictly a pairRDD) and if so, how? If not, what is the purpose of a pairRDD?
Upvotes: 1
Views: 3287
Reputation: 6242
Use the lookup
function, which belongs to PairRDDFunctions
. From the official documentation:
Return the list of values in the RDD for key key. This operation is done efficiently if the RDD has a known partitioner by only searching the partition that the key maps to.
https://spark.apache.org/docs/0.8.1/api/core/org/apache/spark/rdd/PairRDDFunctions.html
And if you just want to see the contents of your RDD, you simply call collect
.
Upvotes: 4