Reputation: 55
auto = sc.textFile("temp/auto_data.csv")
auto = auto.map(lambda x: x.split(","))
header = auto.first()
autoData = auto.filter(lambda a: a!=header)
now I have data in autoData
[[u'', u'ETZ', u'AS1', u'CUT000021', u'THE TU-WHEEL SPARES', u'DIBRUGARH', u'201505', u'LCK ', u'2WH ', u'KIT', u'KT-2069CZ', u'18', u'8484'], [u'', u'ETZ', u'AS1', u'CUT000021', u'THE TU-WHEEL SPARES', u'DIBRUGARH', u'201505', u'LCK ', u'2WH ', u'KIT', u'KT-2069SZ', u'9', u'5211']]
now I want to perform groupBy()
on 2nd and 12th(last) values. How to do this?
Upvotes: 4
Views: 13318
Reputation: 330353
groupBy
takes as an argument a function that generates keys so you can do something like this:
autoData.groupBy(lambda row: (row[2], row[12]))
Edit:
Regarding task you've described in the comments. groupBy
only gathers data in groups but it doesn't aggregate it.
from operator import add
def int_or_zero(s):
try:
return int(s)
except ValueError:
return 0
autoData.map(lambda row: (row[2], int_or_zero(row[12]))).reduceByKey(add)
Highly inefficient version using groupBy
could look like this:
(autoData.map(lambda row: (row[2], int_or_zero(row[12])))
.groupByKey()
.mapValues(sum))
Upvotes: 2