Megha_991
Megha_991

Reputation: 110

Joining multiple tables multiple ways

I have 5 tables like below: Table A

rank. input.
0. aa
1. bb
2 cc
3 dd ​

Table B

rank. input.
0. aa
3 cc
4 dd
5 ee ​

Table C

rank. input.
0. aa
5 ee
6 ff
7 gg ​

Table D

rank. input.
0. aa
2 bb
6 ff
7 gg ​

I need the output to be like below:

Final table

rank. input.
0. aa
2 bb
3 cc
5 ee ​
6 ff
7 gg ​

If i just cross join all the tables depending on the biggest table, i get the below output :

rank. input.
0. aa

Is there a way to get the output i want without having to do multiple joins across AB,BC,CD,BD etc..

Please let me know. I can either use SQL or Pyspark to do this. Any suggestions would be appreciated.

Upvotes: 0

Views: 43

Answers (1)

mck
mck

Reputation: 42412

You can union all the tables, group by input and get the maximum of the rank:

select max(`rank`) as `rank`, input
from (
    select * from tableA
    union all
    select * from tableB
    union all
    select * from tableC
    union all
    select * from tableD
) t
group by input

In Pyspark it would be

from functools import reduce

df = reduce(lambda a, b: a.unionAll(b), [tableA,tableB,tableC,tableD])
result = df.groupBy('input').agg(F.max('rank').alias('rank'))

Upvotes: 2

Related Questions