Pyspark sql: Create a new column based on whether a value exists in a different DataFrame's column

Question

I tried to follow this answer but my question is slightly different.

I have two pyspark data frames df2 and bears2. Both have an integer variable, and I want to create a boolean like this pseudocode:

df3 = df2.withColumn("game", (df2.week_id.isin(bears2.week_if), 1,0))

Basically, if the value of df2 exists in the corresponding column of bears2, I want a 1 else a 0

I tried the expr() from the other question, but wasn't able to get it to work. It looked like this:

new_column_1 = F.expr(
    """IF(df2.week_id IN(bears2.week_if), 1, 0))"""
    )

pault · Accepted Answer

You can't currently use IN like that with pyspark-sql¹. Instead, you will have to join the DataFrames.

Try something like:

from pyspark.sql.functions import col, when
df3 = df2.withColumn("id", col("week_id")).alias("df2")\
    .join(bears2.withColumn("id", col("week_if")).alias("bears2"), on="id", how="left")\
    .select("df2.*", when(col("bears2.id").isNotNull(), 1).otherwise(0))

For the join to work, the join key column has to exist in both DataFrames. For that reason, I first call withColumn("id", ...) to rename the columns to the same value for the join.

Next we do a LEFT join to keep all of the columns in df2. Finally we select all of the columns in df2 and use pyspark.sql.functions.when() to create the boolean column.

The first argument to when() is a condition. If it is True, the second argument is returned. If not, the value in otherwise() is used.

Pyspark sql: Create a new column based on whether a value exists in a different DataFrame's column

Answers (1)

Related Questions

Pyspark sql: Create a new column based on whether a value exists in a different DataFrame&#39;s column

Answers (1)

Related Questions

Pyspark sql: Create a new column based on whether a value exists in a different DataFrame's column