Sachin Sukumaran
Sachin Sukumaran

Reputation: 715

pySpark - groupByKey Not working

I have the bellow data- I want to group with the first element - I am trying with pySpark core ( NOT Spark SQL)

(u'CRIM SEXUAL ASSAULT', u'HZ256372', u'003', u'43'), 
(u'THEFT', u'HZ257172', u'011', u'27'), 
(u'ASSAULT', u'HY266148', u'019', u'6'), 
(u'WEAPONS VIOLATION', u'HY299741', u'010', u'29'), 
(u'CRIM SEXUAL ASSAULT', u'HY469211', u'025', u'19'), 
(u'NARCOTICS', u'HY313819', u'016', u'11'), 
(u'NARCOTICS', u'HY215976', u'003', u'42'), 
(u'NARCOTICS', u'HY360910', u'011', u'27'), 
(u'NARCOTICS', u'HY381916', u'015', u'25') 

I tried with

file.groupByKey().map(lambda x : (x[0], list(x[1]))).collect()

this didnt worked out

Upvotes: 0

Views: 1354

Answers (2)

Sachin Sukumaran
Sachin Sukumaran

Reputation: 715

Got this worked with the bellow code

from pyspark import SparkContext
sc = SparkContext()
def chicagofile(line):
        sLine = line.split(",")
        cNum = sLine[1]
        cDist = sLine[11]
        cType = sLine[5]
        cCommArea = sLine[13]
        return (cType,cNum,cDist,cCommArea)
cFile = sc.textFile("/user/sachinkerala6174/inData/ChicagoCrime15/crimes2015.csv")
getFile = cFile.map(chicagofile)
mapCType = getFile.map(lambda x : (x[0],(x[1],x[2],x[3])))
grp = mapCType.groupByKey().map(lambda x : (x[0], (list(x[1]))))
saveFile = grp.saveAsTextFile("/user/sachinkerala6174/inData/ChicagoCrime15/res1")
print grp.collect()

Upvotes: 0

user6022341
user6022341

Reputation:

It shouldn't work. groupByKey can be called only on RDD of key-value pairs (How to determine if object is a valid key-value pair in PySpark) and a tuple of arbitrary length is not.

Decide which value is a key and map or keyBy first. For example

rdd.map(lambda x: (x[0], x[1:])).groupByKey()

Upvotes: 3

Related Questions