user1124702
user1124702

Reputation: 1135

Alternatives to join on Spark dataframe

I have two Spark dataframes that are each 539 million rows and 4 columns. columns A and B are strings and columns C,D, E and F are floats

DF1
-----------------
A      B     C      D
"A1"  "B1"  1.1    1.2
"A2"  "B2"  1.3    1.4

DF2
-----------------
A      B     E      F
"A1"  "B1"  2.1    2.2
"A2"  "B2"  2.3    2.4

I would like to join DF1 (539 million rows) and DF2 (also 539 million rows). I tried DF1.join(DF2,, "fullouter") on 50 node cluster and 8 GB executor memory. It automatically terminates the cluster with out-of-memory error message.

Are there alternatives to join with rdds, or datasets that are memory efficient than the df.join()?

Upvotes: 0

Views: 991

Answers (2)

Nitin Kumar
Nitin Kumar

Reputation: 249

You also need to check following: 1) How have you partitioned your data 2) how many executors have you assigned based on number of partitions 3) As mentioned above: the driver memory

Upvotes: 0

Hitesh Shahani
Hitesh Shahani

Reputation: 69

Please check the following

  1. Compression as been used.
  2. Join condition is present.
  3. Check you driver program heap memory from the spark UI and change it.

Upvotes: 1

Related Questions