PySpark: join using isin to find if a column in one dataframe is substring of another column of another dataframe

Question

I have tried searching if someone has asked this question about PySpark but I had no success.

I have a DataFrame of messy names, called df1 (as indicated in the image) and I prepared a DataFrame of clean names, called df2 (see the image). How can I use .join() and .isin() or anything else to obtain the last table that is in the attached image?

Here is the image:

I have tried

cond = [df2[Clean_names].isin(df1[Names])]

df1 = df1.join(df2, cond, "left")

but the result was an error saying that .join() expects something else as arguments. I'm sorry, I don't have the exact error log anymore. The real DataFrames are quite big, so I can't use any iterative operations (i.e. for loops, work on pandas with .loc(), work on pandas at all...)

Also I just created an account on stackoverflow, so I'm sorry I couldn't format my question better.

PySpark: join using isin to find if a column in one dataframe is substring of another column of another dataframe

Answers (0)

Related Questions