Reputation: 1014
I have a huge RDD in which I want to sort individual partitions locally. I looked into sortByKey operation, but it is not clear whether it invokes a shuffle or not. (I want to avoid the shuffle)
In Cloudera blog it is mentioned that sortByKey would involve shuffle but from the javadoc of sortByKey, it looks like there is no shuffle till collect() is invoked.
Question: Does sortByKey() involve shuffling of data ? If yes, then what would be the best way to sort data in each RDD partition ? If no, then how does collect() makes everything globally sorted ?
Upvotes: 0
Views: 2222
Reputation: 5700
Basically sortByKey() is an wide type transformation. Since All the transformation operations are lazy in nature, shuffling will happen only when u trigger an action (in your case collect()). In general, transformation are like instructions for an operation. Action will use this instructions for its executions. you can also refer DAG for more clear picture.
Upvotes: 0
Reputation: 33242
It involves a shuffle, but of course this happen only when an action is involved, as collect or take, in your graph of execution. This is because when the result of the sort has to be consumed from other transforms, record with same key has to be directed to the same consumer on the cluster.
Upvotes: 3