nikos
nikos

Reputation: 3013

PySpark - Convert an RDD into a key value pair RDD, with the values being in a List

I have an RDD with tuples being in the form:

[("a1","b1","c1","d1","e1"), ("a2","b2","c2","d2","e2"), ...

What I want is to transform that into a key-value pair RDD, where the first field will be the first string (key) and the second field a list of strings (value), i.e. I want to turn it to the form:

[("a1",["b1","c1","d1","e1"]), ("a2",["b2","c2","d2","e2"]), ...

Upvotes: 4

Views: 20358

Answers (1)

B.Mr.W.
B.Mr.W.

Reputation: 19648

>>> rdd = sc.parallelize([("a1","b1","c1","d1","e1"), ("a2","b2","c2","d2","e2")])

>>> result = rdd.map(lambda x: (x[0], list(x[1:])))

>>> print result.collect()
[('a1', ['b1', 'c1', 'd1', 'e1']), ('a2', ['b2', 'c2', 'd2', 'e2'])]

Explanation of lambda x: (x[0], list(x[1:])):

  1. x[0] will make the first element to be the first element of the output
  2. x[1:] will make all the elements except the first one to be in the second element
  3. list(x[1:]) will force that to be a list because the default will be a tuple

Upvotes: 9

Related Questions