pySpark - groupByKey Not working

Question

I have the bellow data- I want to group with the first element - I am trying with pySpark core ( NOT Spark SQL)

(u'CRIM SEXUAL ASSAULT', u'HZ256372', u'003', u'43'), 
(u'THEFT', u'HZ257172', u'011', u'27'), 
(u'ASSAULT', u'HY266148', u'019', u'6'), 
(u'WEAPONS VIOLATION', u'HY299741', u'010', u'29'), 
(u'CRIM SEXUAL ASSAULT', u'HY469211', u'025', u'19'), 
(u'NARCOTICS', u'HY313819', u'016', u'11'), 
(u'NARCOTICS', u'HY215976', u'003', u'42'), 
(u'NARCOTICS', u'HY360910', u'011', u'27'), 
(u'NARCOTICS', u'HY381916', u'015', u'25')

I tried with

file.groupByKey().map(lambda x : (x[0], list(x[1]))).collect()

this didnt worked out

user6022341 · Accepted Answer

It shouldn't work. groupByKey can be called only on RDD of key-value pairs (How to determine if object is a valid key-value pair in PySpark) and a tuple of arbitrary length is not.

Decide which value is a key and map or keyBy first. For example

rdd.map(lambda x: (x[0], x[1:])).groupByKey()

pySpark - groupByKey Not working

Answers (2)

Related Questions