Ace Haidrey
Ace Haidrey

Reputation: 1228

Getting the first item for a tuple for eaching a row in a list in pyspark

I'm a bit new to Spark and I am trying to do a simple mapping.
My data is like the following:

RDD((0, list(tuples)), ..., (19, list(tuples))

What I want to do is grabbing the first item in each tuple, so ultimately something like this:

RDD((0, list(first item of each tuple),..., (19, list(first item of each tuple))

Can someone help me out with how to map this?
I'll appreciate that!

Upvotes: 2

Views: 2163

Answers (2)

OneCricketeer
OneCricketeer

Reputation: 191983

Something like this?

kv here meaning "key-value" and mapping itemgetter over the values. So, map within a map :-)

from operator import itemgetter
rdd = sc.parallelize([(0, [(0,'a'), (1,'b'), (2,'c')]), (1, [(3,'x'), (5,'y'), (6,'z')])])
mapped = rdd.mapValues(lambda v: map(itemgetter(0), v))

Output

mapped.collect()
[(0, [0, 1, 2]), (1, [3, 5, 6])]

Upvotes: 2

AChampion
AChampion

Reputation: 30288

You can use mapValues to convert the list of tuples to a list of tuple[0]:

rdd.mapValues(lambda x: [t[0] for t in x])

Upvotes: 4

Related Questions