Reputation: 121
I would like to do the following in pyspark (for AWS Glue jobs):
JOIN a and b ON a.name = b.name AND a.number= b.number AND a.city LIKE b.city
So for example:
Table a:
Number | Name | City |
---|---|---|
1000 | Bob | % |
2000 | Joe | London |
Table b:
Number | Name | City |
---|---|---|
1000 | Bob | Boston |
1000 | Bob | Berlin |
2000 | Joe | Paris |
Results
Number | Name | City |
---|---|---|
1000 | Bob | Boston |
1000 | Bob | Berlin |
So the part I don't know how to do is to implement the wildcard "%" and use the LIKE operator. I know you can use .like()
on strings, for example:
df.where(col('col1').like("%string%")).show()
But it expects a string, where in my case I would like to do it as a column. Something like the following:
result = a.join(
b,
(a.name == b.name) &
(a.number == b.number) &
(a.city.like(b.city)) # <-- This doesnt work since it is not a string
Any help to do this will be very appreciated!
Upvotes: 6
Views: 8743
Reputation: 42352
Try using an expression:
import pyspark.sql.functions as F
result = a.alias('a').join(
b.alias('b'),
(a.name == b.name) &
(a.number == b.number) &
F.expr("b.city like a.city")
)
I think you meant to do b like a
rather than a like b
because the %
is in table a.
Upvotes: 6