Reputation: 371
I have two tables I want to join:
Table X:
country | city | user |
---|---|---|
USA | Boston | David |
USA | Miami | John |
France | Paris | Peter |
Table Y:
Country | detail | value | id |
---|---|---|---|
USA | city | Boston | 1 |
USA | null | null | 2 |
France | null | null | 3 |
And this is the output I want:
Country | id | city | user |
---|---|---|---|
USA | 1 | Boston | David |
USA | 2 | null | David |
USA | 2 | null | John |
France | 3 | null | Peter |
The way I get this in SQL is:
select country, id, city, user
from X
join Y
on x.country = y.country
and if(y.detail='city', x.city=y.value, TRUE)
How can I get in pyspark?
Upvotes: 0
Views: 428
Reputation: 3676
You can do so with the code below, however I had to select y.value
and alias it to city in order to get your example output.
d1 = [
('USA', 'Boston', 'David'),
('USA', 'Miami', 'John'),
('France', 'Paris', 'Peter')
]
d2 = [
('USA', 'city', 'Boston', 1),
('USA', None, None, 2),
('France', None, None, 3)
]
x = spark.createDataFrame(d1, ['country', 'city', 'user'])
y = spark.createDataFrame(d2, ['country', 'detail', 'value', 'id'])
cond = (x.country == y.country) & (when(y.detail == 'city', x.city == y.value).otherwise(F.lit(True)))
x.join(y, on=cond).select(x.country, y.id, y.value.alias('city'), x.user).orderBy('id').show()
+-------+---+------+-----+
|country| id| city| user|
+-------+---+------+-----+
| USA| 1|Boston|David|
| USA| 2| null|David|
| USA| 2| null| John|
| France| 3| null|Peter|
+-------+---+------+-----+
Upvotes: 1