check first dataframe value startswith any of the second dataframe value

Question

I have two pyspark dataframe as follow :

df1 = spark.createDataFrame(
    ["yes","no","yes23", "no3", "35yes", """41no["maybe"]"""],
    "string"
).toDF("location")

df2 = spark.createDataFrame(
    ["yes","no"],
    "string"
).toDF("location")

i want to check if values in location col from df1, startsWith, values in location col of df2 and vice versa.

Something like :

df1.select("location").startsWith(df2.location)

Following is the output i am expecting here:

+-------------+
|     location|
+-------------+
|          yes|
|           no|
|        yes23|
|          no3|
+-------------+

mck · Accepted Answer

Using spark SQL looks the easiest to me:

df1.createOrReplaceTempView('df1')
df2.createOrReplaceTempView('df2')
joined = spark.sql("""
    select df1.*
    from df1
    join df2
    on df1.location rlike '^' || df2.location
""")

check first dataframe value startswith any of the second dataframe value

Answers (1)

Related Questions