user2200660
user2200660

Reputation: 1271

pyspark processing a text file

I have a file like this which I have read in PythonRDD.

[(u'id1', u'11|12|13|14|15|16|17|18|,21|22|23|24|25|26|27|28|), (u'id2', u'31|32|33|34|35|36|37|38|,41|42|43|44|45|46|47|28|)]

Representation: the RDD is a pairRDD where each key is a user id(id1, id2) and each value has multiple records (separated by comma) and each record has multiple items (separated by pipe)

I want to reduce the file such that each id: (id1 and id2) will emit as many lines as they have number of record, with user id as the key and some 7th field/5th field , 6th field as the value

id1 => 17/15, 16
id1 => 27/25, 26
id3 => 37/35, 36
id4 => 47/45, 46

Any help is appreciated

Upvotes: 1

Views: 1274

Answers (1)

G Quintana
G Quintana

Reputation: 4667

Try something like this (flatMap is the trick):

input=[(u'id1', u'11|12|13|14|15|16|17|18|,21|22|23|24|25|26|27|28|'), (u'id2', u'31|32|33|34|35|36|37|38|,41|42|43|44|45|46|47|28|')]
inputRdd=sc.parallelize(input)

def splitAtPipe(value):
  valueParts=value.split('|')
  return (valueParts[6]+"/"+valueParts[4],valueParts[5])

inputRdd.flatMapValues(lambda data: data.split(","))
  .mapValues(splitAtPipe)
  .map(lambda (idx, (data1, data2)): (idx, data1, data2))
  .collect()

# Result
# [(u'id1', u'17/15', u'16'), (u'id1', u'27/25', u'26'), (u'id2', u'37/35', u'36'), (u'id2', u'47/45', u'46')]

Upvotes: 1

Related Questions