skolukmar
skolukmar

Reputation: 331

PySpark join dataframes with LIKE

I try to join dataframes using a LIKE expression in which the conditions (content of LIKE) is stores in a column. Is it possible in PySpark 2.3?

Source dataframe:
+---------+----------+
|firstname|middlename|
+---------+----------+
|    James|          |
|  Michael|      Rose|
|   Robert|  Williams|
|    Maria|      Anne|
+---------+----------+
 
Second dataframe
+---------+----+
|condition|dest|
+---------+----+
|      %a%|Box1|
|      %b%|Box2|
+---------+----+

Expected result:
+---------+----------+---------+----+
|firstname|middlename|condition|dest|
+---------+----------+---------+----+
|    James|          |      %a%|Box1|
|  Michael|      Rose|      %a%|Box1|
|   Robert|  Williams|      %b%|Box2|
|    Maria|      Anne|      %a%|Box1|
+---------+----------+---------+----+

Let me reproduce the issue on the sample below. Let's create a sample dataframe:

from pyspark.sql.types import StructType,StructField, StringType, IntegerType

data = [("James",""),
    ("Michael","Rose"),
    ("Robert","Williams"),
    ("Maria","Anne")
  ]

schema = StructType([ \
    StructField("firstname",StringType(),True), \
    StructField("middlename",StringType(),True)
  ])
 
df = spark.createDataFrame(data=data,schema=schema)
df.show()

and the second one:

mapping = [("%a%","Box1"),("%b%","Box2")]
  
schema = StructType([ \
    StructField("condition",StringType(),True), \
    StructField("dest",StringType(),True)
  ])
  
map = spark.createDataFrame(data=mapping,schema=schema)
map.show()

If I am rights, it is not possible to use LIKE during join dataframes, so I have created a crossJoin and tried to use a filter with like, but is it possible to take the content from a column, not a fixed string? This is invalid syntax of cource, but I am looking for another solution:

df.crossJoin(map).filter(df.firstname.like(map.condition)).show()

Upvotes: 0

Views: 656

Answers (1)

Kombajn zbożowy
Kombajn zbożowy

Reputation: 10693

Any expression can be used as a join condition. True, with DataFrame API like function's parameter can only be str, not Column, so you can't have col("firstname").like(col("condition")). However SQL version does not have this limitation so you can leverage expr:

df.join(map, expr("firstname like condition")).show()

Or just plain SQL:

df.createOrReplaceTempView("df")
map.createOrReplaceTempView("map")
spark.sql("SELECT * FROM df JOIN map ON firstname like condition").show()

Both return the same result:

+---------+----------+---------+----+
|firstname|middlename|condition|dest|
+---------+----------+---------+----+
|    James|          |      %a%|Box1|
|  Michael|      Rose|      %a%|Box1|
|   Robert|  Williams|      %b%|Box2|
|    Maria|      Anne|      %a%|Box1|
+---------+----------+---------+----+

Upvotes: 2

Related Questions