Daniel
Daniel

Reputation: 2092

join DataFrames within partitions in PySpark

I have two dataframes with a large (millions to tens of millions) number of rows. I'd like to do a join between them.

In the BI system I'm currently using, you make this fast by first partitioning on a particular key, then doing the join on that key.

Is this a pattern that I need to be following in Spark, or does that not matter? It seems at first glance like a lot of time is wasted shuffling data between partitions, because it hasn't been pre-partitioned correctly.

If it is necessary, then how do I do that?

Upvotes: 1

Views: 1534

Answers (1)

user9142694
user9142694

Reputation: 26

If it is necessary, then how do I do that?

How to define partitioning of DataFrame?

However it makes sense only under two conditions:

  • There multiple joins withing the same application. Partitioning shuffles itself, so if it is a single join there is no added value.
  • It is long lived application where shuffled data will be reused. Spark cannot take advantage of the partitioning of the data stored in the external format.

Upvotes: 1

Related Questions