pyspark reduce by key not giving proper value

Question

I have few key value pair in a text file in comma-separated fashion like (1,23) (2,25) (1,45) etc. here 1 is the key and 23 is the value. Now in spark i am doing reduce by key operation

entries = sc.textFile("scala/test.txt")
sum = textFile.map(lambda entry : (entry.split(',')[0]), entry.split(',')[1] )).reduceByKey(lambda val1,val2 : val1 + " " +val2)

the output i am getting as

(u'1', u'23  45')

where u'1'' is the key and u'23 45' are values supposed to be added. This I can understand because here key value both are string after splitting so basically both values are being concatenated.

but if I want to get them as integer I am doing

sum = textFile.map(lambda entry : (int(entry.split(',')[0])), int(entry.split(',')[1] ))).reduceByKey(lambda val1,val2 : val1 + val2)

but here I am getting error. please help me how get rid of this u' strings.

vijay kumar · Accepted Answer

try using: encode('utf8') to getrid of u'

input:

[ramisetty@dragon1 vijay]$ cat test.txt
1,23
2,25
1,45

$SPARK_HOME/bin/pyspark

>>>entries = sc.textFile("/home/ramisetty/vijay/test.txt")
>>>sum = entries.map(lambda entry : (entry.split(',')[0].encode('utf8'), entry.split(',')[1].encode('utf8'))).reduceByKey(lambda val1,val2 : int(val1)+int(val2))
>>> sum.collect()

result:

[('2', '25'), ('1', 68)]

pyspark reduce by key not giving proper value

Answers (1)

Related Questions