Reputation: 45

last element in RDD using python spark

I am trying to get the last element information from a Spark RDD. I have sorted the RDD with respective the value of a (key, value) pair.

My data in RDD

(8, 0.98772733936789858)

(4, 3.0599761935471004)

(2, 3.1913934060593321)

(1, 4.9646263295153013)

(5, 5.3596802463208792)

(7, 5.5829277439661071)

(9, 6.4739040233992258)

(0, 6.9343681509951081)

(6, 7.4699692671955953)

(3, 8.6579764626088771)

I am able to get the first (key, value) pair using the first function, but not able to figure out how to get the last one. I can do a swap of (key, value) pair to (value, key) pair and get the required data using .max function. However, is there any other way to get the last element from a RDD using Python spark?

Upvotes: 1

Answers (2)

sparedes

Reputation: 91

Yes, there are other ways.

Here are a few (including yours) along with a very informal ranking of performance based on 1000 tests per method with one local worker thread on my machine -- using the dataset you provided in the question.

1) max()

Find the maximum item in this RDD.

output = (
          rdd.map(lambda (a, b): (b, a))
          .max()
         )

This was the 1st fastest on average.

2) sortByKey() -- with first()

Sorts this RDD, which is assumed to consist of (key, value) pairs.

Return the first element in this RDD.

output = (
          rdd.map(lambda (a, b): (b, a))
          .sortByKey(ascending=False)
          .first()
         )

This was the 4th fastest on average.

3) top()

Get the top N elements from a RDD.

output = (
          rdd.map(lambda (a, b): (b, a))
          .top(1)
         )

This was the 3rd fastest on average

4) top() -- with comparison key

Get the top N elements from a RDD.

output = (
          rdd.top(1, key=lambda x: x[1])
         )

This was the 2nd fastest on average.

Final thoughts

You'll notice that the 4th method doesn't swap the (key/value)'s. Instead it sweeps over the rdd with key ('key' the argument -- not part of your rdd) specifying a function of one argument that is used to extract a comparison key from each element in the iterable, and in this case the comparison key is the second item in the your (key, value) tuples, i.e. value.

So method 1, max(), is perfectly good. But...

Once you're in the territory where you need the 'last n elements' (i.e. more than just the last element) then I would say method 4 is the preferred way to go.

Upvotes: 4

DPM

Reputation: 1581

RDD.first() is pretty efficient because it can be executed in a short-circuit fashion. Since you're sorting the data anyway, by the second value in the tuple, sort the RDD reversed and then just take the first element.

Upvotes: 3