Joining multiple tables multiple ways

Question

I have 5 tables like below: Table A

rank.	input.
0.	aa
1.	bb
2	cc
3	dd

Table B

rank.	input.
0.	aa
3	cc
4	dd
5	ee

Table C

rank.	input.
0.	aa
5	ee
6	ff
7	gg

Table D

rank.	input.
0.	aa
2	bb
6	ff
7	gg

I need the output to be like below:

Final table

rank.	input.
0.	aa
2	bb
3	cc
5	ee
6	ff
7	gg

If i just cross join all the tables depending on the biggest table, i get the below output :

rank.	input.
0.	aa

Is there a way to get the output i want without having to do multiple joins across AB,BC,CD,BD etc..

Please let me know. I can either use SQL or Pyspark to do this. Any suggestions would be appreciated.

mck · Accepted Answer

You can union all the tables, group by input and get the maximum of the rank:

select max(`rank`) as `rank`, input
from (
    select * from tableA
    union all
    select * from tableB
    union all
    select * from tableC
    union all
    select * from tableD
) t
group by input

In Pyspark it would be

from functools import reduce

df = reduce(lambda a, b: a.unionAll(b), [tableA,tableB,tableC,tableD])
result = df.groupBy('input').agg(F.max('rank').alias('rank'))

Joining multiple tables multiple ways

Answers (1)

Related Questions