Reputation: 1135
I have two Spark dataframes that are each 539 million rows and 4 columns. columns A and B are strings and columns C,D, E and F are floats
DF1
-----------------
A B C D
"A1" "B1" 1.1 1.2
"A2" "B2" 1.3 1.4
DF2
-----------------
A B E F
"A1" "B1" 2.1 2.2
"A2" "B2" 2.3 2.4
I would like to join DF1 (539 million rows) and DF2 (also 539 million rows). I tried DF1.join(DF2,, "fullouter") on 50 node cluster and 8 GB executor memory. It automatically terminates the cluster with out-of-memory error message.
Are there alternatives to join with rdds, or datasets that are memory efficient than the df.join()?
Upvotes: 0
Views: 991
Reputation: 249
You also need to check following: 1) How have you partitioned your data 2) how many executors have you assigned based on number of partitions 3) As mentioned above: the driver memory
Upvotes: 0
Reputation: 69
Please check the following
Upvotes: 1