theMadKing
theMadKing

Reputation: 2074

pySpark .reduceByKey(min)/max weird behavior

I have the following function:

minTotal = numRDD.reduceByKey(min).collect()
maxTotal = numRDD.reduceByKey(max).collect()

A sample from my dataset that is acting strangely:

(18, [u'300.0', u'1000.0', u'300.0', u'300.0', u'300.0', u'300.0', u'300.0', u'300.0', u'1000.0', u'300.0', u'300.0', u'300.0', u'300.0', u'300.0', u'300.0', u'300.0', u'300.0', u'300.0', u'300.0', u'300.0', u'300.0', u'300.0', u'300.0', u'300.0', u'300.0', u'300.0', u'300.0', u'300.0', u'300.0', u'300.0', u'300.0', u'300.0', u'300.0', u'300.0', u'300.0', u'300.0', u'300.0', u'300.0', u'300.0', u'300.0']

The min is reported as: 1000 and Max as 300

Very odd to me all my other key/values are reporting correctly except for this one. Not sure what is going on here.

Upvotes: 0

Views: 1471

Answers (1)

theMadKing
theMadKing

Reputation: 2074

Forgot that they are unicode and they will be evaluating as strings not their numeric form. So you need to convert to float to get the correct answer.

Upvotes: 1

Related Questions