user1742188
user1742188

Reputation: 5133

How to get nth row of Spark RDD?

Suppose I have an RDD of arbitrary objects. I wish to get the 10th (say) row of the RDD. How would I do that? One way is to use rdd.take(n) and then access the nth element is the object, but this approach is slow when n is large.

Upvotes: 12

Views: 34553

Answers (3)

Neeraj Mehta
Neeraj Mehta

Reputation: 69

RDD.collect() and RDD.take(x) both return a list, which supports indexing. So each time we need an element at position N.We can perform any of following two codes: RDD.collect()[N-1] or RDD.take(N)[N-1] will work fine when we want element at position N.

Upvotes: 6

Jack Daniel
Jack Daniel

Reputation: 2611

I haven't checked this for the huge data. But it works fine for me.

Lets say n=2, I want to access the 2nd element,

   data.take(2).drop(1)

Upvotes: 10

Nicola Ferraro
Nicola Ferraro

Reputation: 4189

I don't know how much it is efficient, as it depends on the current and future optimizations in the Spark's engine, but you can try doing the following:

rdd.zipWithIndex.filter(_._2==9).map(_._1).first()

The first function transforms the RDD into a pair (value, idx) with idx going from 0 onwards. The second function takes the element with idx==9 (the 10th). The third function takes the original value. Then the result is returned.

The first function could be pulled up by the execution engine and influence the behavior of the whole processing. Give it a try.

In any case, if n is very large, this method is efficient in that it does not require to collect an array of the first n elements in the driver node.

Upvotes: 15

Related Questions