Reputation: 31
I have an RDD that I've used to load binary files. Each file is broken into multiple parts and processed. After the processing step, each entry is:
(filename, List[Results])
Since the files are broken into several parts, the filename is the same for several entries in the RDD. I'm trying to put the results for each part back together using reduceByKey. However, when I attempt to run a count on this RDD it returns 0:
val reducedResults = my_rdd.reduceByKey((resultsA, resultsB) => resultsA ++ resultsB)
reducedResults.count() // 0
I've tried changing the key it uses with no success. Even with extremely simple attempts to group the results I don't get any output.
val singleGroup = my_rdd.groupBy((k, v) => 1)
singleGroup.count() // 0
On the other hand, if I simply collect the results, then I can group them outside of Spark and everything works fine. However, I still have additional processing that I need to do on the collected results, so that isn't a good option.
What could cause the groupBy/reduceBy commands to return empty RDDs if the initial RDD isn't empty?
Upvotes: 2
Views: 651
Reputation: 31
Turns out there was a bug in how I was generating the Spark configuration for that particular job. Instead of setting the spark.default.parallelism
field to something reasonable, it was being set to 0.
From the Spark documentation on spark.default.parallelism
:
Default number of partitions in RDDs returned by transformations like join, reduceByKey, and parallelize when not set by user.
So while an operation like collect()
worked perfectly fine, any attempt to reshuffle the data without specifying the number of partitions gave me an empty RDD. That'll teach me to trust old configuration code.
Upvotes: 1