ochristm
ochristm

Reputation: 21

Pyspark. Join two df by string in text

I need to join those two dfs and sum rate


-I've tried:

df_joined = df_text.join(df_words, f.expr("text rlike word"), 'left')

-or

df_joined = df_text.join(df_words, on=df_text.text.contains(df_words.word),how='left')


But it finds part of word too (e.g - df_words contains "slow" and "slowly", and if "slowly" is in text, two rates joins, but I need only one - "slowly").

Any suggestions?Thanks

Upvotes: 1

Views: 97

Answers (1)

ochristm
ochristm

Reputation: 21

This seems to work fine)

split_col = f.split(df_text['text'], ' ')

df_text = df_text.withColumn('txt_split', split_col)

df_join = df_text.withColumn('word', f.explode("txt_split").alias("word"))\
    .join(df_words, "word", 'left')

Upvotes: 1

Related Questions