pySpark .reduceByKey(min)/max weird behavior

Question

I have the following function:

minTotal = numRDD.reduceByKey(min).collect()
maxTotal = numRDD.reduceByKey(max).collect()

A sample from my dataset that is acting strangely:

(18, [u'300.0', u'1000.0', u'300.0', u'300.0', u'300.0', u'300.0', u'300.0', u'300.0', u'1000.0', u'300.0', u'300.0', u'300.0', u'300.0', u'300.0', u'300.0', u'300.0', u'300.0', u'300.0', u'300.0', u'300.0', u'300.0', u'300.0', u'300.0', u'300.0', u'300.0', u'300.0', u'300.0', u'300.0', u'300.0', u'300.0', u'300.0', u'300.0', u'300.0', u'300.0', u'300.0', u'300.0', u'300.0', u'300.0', u'300.0', u'300.0']

The min is reported as: 1000 and Max as 300

Very odd to me all my other key/values are reporting correctly except for this one. Not sure what is going on here.

theMadKing · Accepted Answer

Forgot that they are unicode and they will be evaluating as strings not their numeric form. So you need to convert to float to get the correct answer.

pySpark .reduceByKey(min)/max weird behavior

Answers (1)

Related Questions