Reputation: 2092
I have two dataframes with a large (millions to tens of millions) number of rows. I'd like to do a join between them.
In the BI system I'm currently using, you make this fast by first partitioning on a particular key, then doing the join on that key.
Is this a pattern that I need to be following in Spark, or does that not matter? It seems at first glance like a lot of time is wasted shuffling data between partitions, because it hasn't been pre-partitioned correctly.
If it is necessary, then how do I do that?
Upvotes: 1
Views: 1534
Reputation: 26
If it is necessary, then how do I do that?
How to define partitioning of DataFrame?
However it makes sense only under two conditions:
Upvotes: 1