Reputation: 651
I'm encountering a spark job quitting with error message as empty collection.
java.lang.UnsupportedOperationException: empty collection
I have zoomed into 2 lines that caused the issue.
sum_attribute1 = inputRDD.map(_.attribute1).reduce(_+_)
sum_attribute2 = inputRDD.map(_.attribute2).reduce(_+_)`
Other lines that does .map and .distinct.count is fine. I like to print out inputRDD.map(attribute1) and inputRDD.map(_.attribute2) to see what was map before the reduce.
I thought I could define something like
sum_attribute1 = inputRDD.map(_.attribute1)
but when I tried to compile the code, it shows errors:
[error] found : org.apache.spark.rdd.RDD[Int]
[error] required: Long
[error] sum_attribute1 = inputRDD.map(_.attribute1)
[error] ^
My attribute1 was defined as Int but when I tried to define it as Long, it gave me another error.
Am I going in the right direction? How can I print the data after map and before reduce? What could be the possible issue with empty collection? What does the underscore in _.attribute1 and reduce(_+_) mean?
Upvotes: 0
Views: 384
Reputation: 3514
I don't think that you are going in the right direction, I would focus on the elements below:
I recommend that you learn a bit of scala first. To one of your specific question read about that usage of _.
To another of your question, reduce
cannot be used on empty collection, I recommend using fold
instead as it supports empty collections just fine.
Upvotes: 1