Harish
Harish

Reputation: 3483

Apache Spark Python GroupByKey or reduceByKey or combineByKey

I am trying to process a 3 GB file.The structure of the file is such that it contains multiple lines and a set of n rows can be grouped by a particular Key each appearing at a particular position

Here is the sample File structure

abc123Key1asdas
abc124Key1asdas
abc126Key1asasd
abcw23Key2asdad
asdfsaKey2asdsa
....
.....
.....
abcasdKeynasdas
asfssdfKeynasda
asdaasdKeynsdfa

The structure I am trying to achieve is

((Key1,(abc123Key1asdas,abc124Key1asdas,abc126Key1asasd)),(Key2,(abcw23Key2asdad,asdfsaKey2asdsa)),...(Keyn,(abcasdKeynasdas,asfssdfKeynasda,asdaasdKeynsdfa))

I am trying to do something like this

lines = sc.textFile(fileName)
counts = lines.flatMap(lambda line: line.split('\n')).map(lambda line: (line[10:21],line))
        output = counts.combineByKey().collect()

can anyone help me achieve what I am trying to do?

Upvotes: 0

Views: 609

Answers (1)

Anchit
Anchit

Reputation: 68

Just replace combineByKey() with groupByKey() and then you should be fine.

Example code

data = sc.parallelize(['abc123Key1asdas','abc123Key1asdas','abc123Key1asdas', 'abcw23Key2asdad', 'abcw23Key2asdad', 'abcasdKeynasdas', 'asfssdKeynasda', 'asdaasKeynsdfa'])
data.map(lambda line: (line[6:10],line)).groupByKey().mapValues(list).collect()

[('Key1', ['abc123Key1asdas', 'abc123Key1asdas', 'abc123Key1asdas']), ('Key2', ['abcw23Key2asdad', 'abcw23Key2asdad']), ('Keyn', ['abcasdKeynasdas', 'asfssdKeynasda', 'asdaasKeynsdfa'])]

More info: http://spark.apache.org/docs/latest/api/python/pyspark.html?highlight=groupbykey#pyspark.RDD.groupByKey

Upvotes: 2

Related Questions