How to use LIKE operator as a JOIN condition in pyspark as a column

Question

I would like to do the following in pyspark (for AWS Glue jobs):

JOIN a and b ON a.name = b.name AND a.number= b.number AND a.city LIKE b.city

So for example:

Table a:

Number	Name	City
1000	Bob	%
2000	Joe	London

Table b:

Number	Name	City
1000	Bob	Boston
1000	Bob	Berlin
2000	Joe	Paris

Results

Number	Name	City
1000	Bob	Boston
1000	Bob	Berlin

So the part I don't know how to do is to implement the wildcard "%" and use the LIKE operator. I know you can use .like() on strings, for example:

df.where(col('col1').like("%string%")).show()

But it expects a string, where in my case I would like to do it as a column. Something like the following:

result = a.join(
    b,
    (a.name == b.name) &
    (a.number == b.number) &
    (a.city.like(b.city)) # <-- This doesnt work since it is not a string

Any help to do this will be very appreciated!

mck · Accepted Answer

Try using an expression:

import pyspark.sql.functions as F

result = a.alias('a').join(
    b.alias('b'),
    (a.name == b.name) &
    (a.number == b.number) &
    F.expr("b.city like a.city")
)

I think you meant to do b like a rather than a like b because the % is in table a.

How to use LIKE operator as a JOIN condition in pyspark as a column

Answers (1)

Related Questions