Reputation: 331
I try to join dataframes using a LIKE expression in which the conditions (content of LIKE) is stores in a column. Is it possible in PySpark 2.3?
Source dataframe:
+---------+----------+
|firstname|middlename|
+---------+----------+
| James| |
| Michael| Rose|
| Robert| Williams|
| Maria| Anne|
+---------+----------+
Second dataframe
+---------+----+
|condition|dest|
+---------+----+
| %a%|Box1|
| %b%|Box2|
+---------+----+
Expected result:
+---------+----------+---------+----+
|firstname|middlename|condition|dest|
+---------+----------+---------+----+
| James| | %a%|Box1|
| Michael| Rose| %a%|Box1|
| Robert| Williams| %b%|Box2|
| Maria| Anne| %a%|Box1|
+---------+----------+---------+----+
Let me reproduce the issue on the sample below. Let's create a sample dataframe:
from pyspark.sql.types import StructType,StructField, StringType, IntegerType
data = [("James",""),
("Michael","Rose"),
("Robert","Williams"),
("Maria","Anne")
]
schema = StructType([ \
StructField("firstname",StringType(),True), \
StructField("middlename",StringType(),True)
])
df = spark.createDataFrame(data=data,schema=schema)
df.show()
and the second one:
mapping = [("%a%","Box1"),("%b%","Box2")]
schema = StructType([ \
StructField("condition",StringType(),True), \
StructField("dest",StringType(),True)
])
map = spark.createDataFrame(data=mapping,schema=schema)
map.show()
If I am rights, it is not possible to use LIKE during join dataframes, so I have created a crossJoin and tried to use a filter with like, but is it possible to take the content from a column, not a fixed string? This is invalid syntax of cource, but I am looking for another solution:
df.crossJoin(map).filter(df.firstname.like(map.condition)).show()
Upvotes: 0
Views: 656
Reputation: 10693
Any expression can be used as a join condition. True, with DataFrame API like function's parameter can only be str
, not Column
, so you can't have col("firstname").like(col("condition"))
. However SQL version does not have this limitation so you can leverage expr:
df.join(map, expr("firstname like condition")).show()
Or just plain SQL:
df.createOrReplaceTempView("df")
map.createOrReplaceTempView("map")
spark.sql("SELECT * FROM df JOIN map ON firstname like condition").show()
Both return the same result:
+---------+----------+---------+----+
|firstname|middlename|condition|dest|
+---------+----------+---------+----+
| James| | %a%|Box1|
| Michael| Rose| %a%|Box1|
| Robert| Williams| %b%|Box2|
| Maria| Anne| %a%|Box1|
+---------+----------+---------+----+
Upvotes: 2