Reputation: 110
I have 5 tables like below: Table A
rank. | input. |
---|---|
0. | aa |
1. | bb |
2 | cc |
3 | dd |
Table B
rank. | input. |
---|---|
0. | aa |
3 | cc |
4 | dd |
5 | ee |
Table C
rank. | input. |
---|---|
0. | aa |
5 | ee |
6 | ff |
7 | gg |
Table D
rank. | input. |
---|---|
0. | aa |
2 | bb |
6 | ff |
7 | gg |
I need the output to be like below:
Final table
rank. | input. |
---|---|
0. | aa |
2 | bb |
3 | cc |
5 | ee |
6 | ff |
7 | gg |
If i just cross join all the tables depending on the biggest table, i get the below output :
rank. | input. |
---|---|
0. | aa |
Is there a way to get the output i want without having to do multiple joins across AB,BC,CD,BD etc..
Please let me know. I can either use SQL or Pyspark to do this. Any suggestions would be appreciated.
Upvotes: 0
Views: 43
Reputation: 42412
You can union all the tables, group by input and get the maximum of the rank:
select max(`rank`) as `rank`, input
from (
select * from tableA
union all
select * from tableB
union all
select * from tableC
union all
select * from tableD
) t
group by input
In Pyspark it would be
from functools import reduce
df = reduce(lambda a, b: a.unionAll(b), [tableA,tableB,tableC,tableD])
result = df.groupBy('input').agg(F.max('rank').alias('rank'))
Upvotes: 2