dnf999
dnf999

Reputation: 33

Pyspark when statement

Hi I'm starting to use Pyspark and want to put a when and otherwise condition in:

df_1 = df.withColumn("test", when(df.first_name == df2.firstname & df.last_namne == df2.lastname, "1. Match on First and Last Name").otherwise ("No Match"))

I get the below error and wanted some assistance to understand why the above is not working.

Both df.first_name and df.last_name are strings and also df2.firstname and df2.lastname strings too

Error: ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.

Thanks in advance

Upvotes: 0

Views: 131

Answers (1)

Azhar Khan
Azhar Khan

Reputation: 4108

There are several issues in your statement:

  • For df.withColum(), you can not use df and df2 columns in one statement. First join the two dataframes using df.join(df2, on="some_key", how="left/right/full").
  • Enclose the and condition of "when" clause in round brackets: (df.first_name == df2.firstname) & (df.last_name == df2.lastname)
  • The string literals of "when" and "otherwise" should be enclosed in lit() like: lit("1. Match on First and Last Name") and lit("No Match").
  • There is possibly a typo in your field name df.last_namne.

Upvotes: 1

Related Questions