Reputation: 464
I loaded data from Hbase and did some operation on that data and a paired RDD is created. I want to use the data of this RDD in my next function. I have half million records in RDD. Can you please suggest performance effective way of reading data by key from the paired RDD .
Upvotes: 0
Views: 7909
Reputation: 33
Do the following:
rdd2 = rdd1.sortByKey()
rdd2.lookup(key)
This will be fast.
Upvotes: 1
Reputation: 931
Only from Driver, you can use rdd.lookup(key)
to return all values associated with the provided key.
Upvotes: 1
Reputation: 528
That is a tough use case. Can you use some datastore and index it?
Check out Splice Machine (Open Source).
Upvotes: 1
Reputation: 762
You can use
rddName.take(5)
where 5 is the number of top most elements to be returned. You can change the number accordingly. Also to read the very first element, you can use
rddName.first
Upvotes: 0