Reputation: 715
I have the bellow data- I want to group with the first element - I am trying with pySpark core ( NOT Spark SQL)
(u'CRIM SEXUAL ASSAULT', u'HZ256372', u'003', u'43'),
(u'THEFT', u'HZ257172', u'011', u'27'),
(u'ASSAULT', u'HY266148', u'019', u'6'),
(u'WEAPONS VIOLATION', u'HY299741', u'010', u'29'),
(u'CRIM SEXUAL ASSAULT', u'HY469211', u'025', u'19'),
(u'NARCOTICS', u'HY313819', u'016', u'11'),
(u'NARCOTICS', u'HY215976', u'003', u'42'),
(u'NARCOTICS', u'HY360910', u'011', u'27'),
(u'NARCOTICS', u'HY381916', u'015', u'25')
I tried with
file.groupByKey().map(lambda x : (x[0], list(x[1]))).collect()
this didnt worked out
Upvotes: 0
Views: 1354
Reputation: 715
Got this worked with the bellow code
from pyspark import SparkContext
sc = SparkContext()
def chicagofile(line):
sLine = line.split(",")
cNum = sLine[1]
cDist = sLine[11]
cType = sLine[5]
cCommArea = sLine[13]
return (cType,cNum,cDist,cCommArea)
cFile = sc.textFile("/user/sachinkerala6174/inData/ChicagoCrime15/crimes2015.csv")
getFile = cFile.map(chicagofile)
mapCType = getFile.map(lambda x : (x[0],(x[1],x[2],x[3])))
grp = mapCType.groupByKey().map(lambda x : (x[0], (list(x[1]))))
saveFile = grp.saveAsTextFile("/user/sachinkerala6174/inData/ChicagoCrime15/res1")
print grp.collect()
Upvotes: 0
Reputation:
It shouldn't work. groupByKey
can be called only on RDD of key-value pairs (How to determine if object is a valid key-value pair in PySpark) and a tuple of arbitrary length is not.
Decide which value is a key and map
or keyBy
first. For example
rdd.map(lambda x: (x[0], x[1:])).groupByKey()
Upvotes: 3