Reputation: 672
I have a RDD looking like this:
rdd.take(2)
(ID, avg rating)
[(u'1269', 433355525.39999998), (u'1524', 5693044.25)] ...
I am trying to sort it by function sortBy()
sorted = rdd.sortBy(lambda x: x[1])
It should return sorted list of IDs. I'm getting the following error instead:
ValueError: Unicode float() literal too long to convert
at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166)
at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:207)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
... 1 more
I tired to do convert the float value back to Unicode, and backwards. I tried to round it etc.
[(u'1269', 433355525.0), (u'1524', 5693044.0)]
Maybe using a Decimal would be solution, but I'm using Python 2.6.6 and it seems to me it is overkill anyway.
Spark 1.6.3.
How can I fix this?
Added simple code:
lines = sc.textFile("/user/ahouskova/movies/my.data")
columns_data = lines.map(lambda line: line.split("\t"))
ratings = columns_data.map(lambda c: (c[1], (c[2], 1.0)))
movie_ratings_total_counts = ratings.reduceByKey(lambda m1, m2: (m1[0] + m2[0], m1[1] + m2[1]))
avg_ratings = movie_ratings_total_counts.mapValues(lambda total: round(float(total[0])/total[1]))
sorted_by_avg_rtg = avg_ratings.sortBy(lambda x: x[1])
rounded
[(u'1269', '433355525.0'), (u'1524', '5693044.0')]
string formatted
[(u'1269', '433355525.400'), (u'1524', '5693044.250')]
Upvotes: 1
Views: 213
Reputation: 10450
Based on the new data you provided about the code you are running.
The error your provided is: ValueError: Unicode float() literal too long to convert
The problem seems to be:
reduceByKey
assuming that the second element is a float
while it is actually a string.You can cast the second element to float in this line:
Instead of:
ratings = columns_data.map(lambda c: (c[1], (c[2], 1.0)))
You can do:
ratings = columns_data.map(lambda c: (c[1], (float(c[2]), 1.0)))
Upvotes: 0