la_femme_it
la_femme_it

Reputation: 672

Spark Python: sortBy causes ValueError: Unicode float() literal too long to convert

I have a RDD looking like this:

rdd.take(2)

(ID, avg rating)

[(u'1269', 433355525.39999998), (u'1524', 5693044.25)] ...

I am trying to sort it by function sortBy()

sorted = rdd.sortBy(lambda x: x[1])

It should return sorted list of IDs. I'm getting the following error instead:

ValueError: Unicode float() literal too long to convert

    at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166)
    at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:207)
    at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125)
    at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
    at org.apache.spark.scheduler.Task.run(Task.scala:89)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    ... 1 more

I tired to do convert the float value back to Unicode, and backwards. I tried to round it etc.

[(u'1269', 433355525.0), (u'1524', 5693044.0)]

Maybe using a Decimal would be solution, but I'm using Python 2.6.6 and it seems to me it is overkill anyway.

Spark 1.6.3.

How can I fix this?

Added simple code:

lines = sc.textFile("/user/ahouskova/movies/my.data") columns_data = lines.map(lambda line: line.split("\t")) ratings = columns_data.map(lambda c: (c[1], (c[2], 1.0))) movie_ratings_total_counts = ratings.reduceByKey(lambda m1, m2: (m1[0] + m2[0], m1[1] + m2[1])) avg_ratings = movie_ratings_total_counts.mapValues(lambda total: round(float(total[0])/total[1])) sorted_by_avg_rtg = avg_ratings.sortBy(lambda x: x[1])

rounded

[(u'1269', '433355525.0'), (u'1524', '5693044.0')]

string formatted

[(u'1269', '433355525.400'), (u'1524', '5693044.250')]


Upvotes: 1

Views: 213

Answers (1)

Yaron
Yaron

Reputation: 10450

Based on the new data you provided about the code you are running.

The error your provided is: ValueError: Unicode float() literal too long to convert

The problem seems to be:

  • reading string
  • splitting it by "\t" (note: it is still a string)
  • perform reduceByKey assuming that the second element is a float while it is actually a string.

You can cast the second element to float in this line:

Instead of:

ratings = columns_data.map(lambda c: (c[1], (c[2], 1.0))) 

You can do:

ratings = columns_data.map(lambda c: (c[1], (float(c[2]), 1.0))) 

Upvotes: 0

Related Questions