Reputation: 65
I have several tables, and I would like to filter the rows in one of them, based on whether two columns are present in another table. The data on each table is as follows
Table1 : one hash can be associated to several articles ; one article can be associated to several hashes
User Hash | Article Name |
---|---|
Hash1 | Article1 |
Hash1 | Article2 |
Hash2 | Article1 |
Hash3 | Article3 |
Table2 : each User Hash is uniquely associated to a User ID
User Hash | User ID |
---|---|
Hash1 | ID1 |
Hash2 | ID2 |
Hash3 | ID3 |
Table3 : each Article Name is uniquely associated to an Article Number
Article Name | Article number |
---|---|
Article1 | Number1 |
Article2 | Number2 |
Article3 | Number3 |
Table4
User ID | Article Number | OtherField |
---|---|---|
ID1 | Number1 | Misc1 |
ID2 | Number2 | Misc2 |
ID3 | Number3 | Misc3 |
I would like to keep in Table4, only the lines for which the combination (User ID, Article Number) is present in Table1 (as User Hash and Article Name). So in this example, I would like to get the following result :
User ID | Article Number | OtherField |
---|---|---|
ID1 | Number1 | Misc1 |
ID3 | Number3 | Misc3 |
What is the best way to do this in Spark SQL ? I have experimented using JOIN, but I am struggling with the fact that there are two conditions, which I would like both to be valid in a single row.
In my example, ID2 & Number2 are both in Table1, but not on the same row, so I want to filter this row out of Table4.
I hope the question is clear enough. Thanks in advance !
Upvotes: 0
Views: 1898
Reputation: 42352
You can do a semi join:
select *
from table4
left semi join (
select * from table1
join table2 using (`User Hash`)
join table3 using (`Article Name`)
) using (`User ID`, `Article Number`)
+-------+--------------+----------+
|User ID|Article Number|OtherField|
+-------+--------------+----------+
| ID1| Number1| Misc1|
| ID3| Number3| Misc3|
+-------+--------------+----------+
Or equivalently, where exists
:
select *
from table4
where exists (
select * from table1
join table2 using (`User Hash`)
join table3 using (`Article Name`)
where `User ID` = table4.`User ID`
and `Article Number` = table4.`Article Number`
)
Upvotes: 1
Reputation: 1270081
If I understand correctly, this is just a bunch of joins:
select t2.userid, t3.articlenumber
from table1 t1 join
table2 t2
on t1.userhash = t2.userhash join
table3 t3
on t3.articlename = t1.articlename;
Upvotes: 0