Ashis Parajuli
Ashis Parajuli

Reputation: 145

Or conditions on join result on cross join

I am trying to join two dataset on spark, I am using spark version 2.1,

SELECT * 
  FROM Tb1 
       INNER JOIN Tb2 
          ON Tb1.key1=Tb2.key1 
            OR Tb1.key2=Tb2.Key2

But it results on cross join, how can I join two tables and get only matching records?

I also have tried left outer join, but also it is forcing me to change to cross join instead ??

Upvotes: 1

Views: 1136

Answers (2)

pauli
pauli

Reputation: 4301

Try this method

from pyspark.sql import SQLContext as SQC

sqc = SQC(sc)
x = [(1,2,3), (4,5,6), (7,8,9), (10,11,12), (13,14,15)]
y = [(1,4,5), (4,5,6), (10,11,16),(34,23,31), (56,14,89)]
x_df = sqc.createDataFrame(x,["x","y","z"])
y_df = sqc.createDataFrame(y,["x","y","z"])

cond = [(x_df.x == y_df.x) | ( x_df.y == y_df.y)]

x_df.join(y_df,cond, "inner").show()

output

+---+---+---+---+---+---+
|  x|  y|  z|  x|  y|  z|
+---+---+---+---+---+---+
|  1|  2|  3|  1|  4|  5|
|  4|  5|  6|  4|  5|  6|
| 10| 11| 12| 10| 11| 16|
| 13| 14| 15| 56| 14| 89|
+---+---+---+---+---+---+

Upvotes: 2

Jorge Campos
Jorge Campos

Reputation: 23381

By joining it twice:

 select * 
   from Tb1 
      inner join Tb2 
         on Tb1.key1=Tb2.key1 
      inner join Tb2 as Tb22 
         on Tb1.key2=Tb22.Key2  

Or Left joining both:

 select * 
   from Tb1 
      left join Tb2 
         on Tb1.key1=Tb2.key1 
      left join Tb2 as Tb22 
         on Tb1.key2=Tb22.Key2  

Upvotes: 1

Related Questions