mlewis
mlewis

Reputation: 101

Pyspark sql: Create a new column based on whether a value exists in a different DataFrame's column

I tried to follow this answer but my question is slightly different.

I have two pyspark data frames df2 and bears2. Both have an integer variable, and I want to create a boolean like this pseudocode:

df3 = df2.withColumn("game", (df2.week_id.isin(bears2.week_if), 1,0))

Basically, if the value of df2 exists in the corresponding column of bears2, I want a 1 else a 0

I tried the expr() from the other question, but wasn't able to get it to work. It looked like this:

new_column_1 = F.expr(
    """IF(df2.week_id IN(bears2.week_if), 1, 0))"""
    )

Upvotes: 7

Views: 11407

Answers (1)

pault
pault

Reputation: 43544

You can't currently use IN like that with pyspark-sql1. Instead, you will have to join the DataFrames.

Try something like:

from pyspark.sql.functions import col, when
df3 = df2.withColumn("id", col("week_id")).alias("df2")\
    .join(bears2.withColumn("id", col("week_if")).alias("bears2"), on="id", how="left")\
    .select("df2.*", when(col("bears2.id").isNotNull(), 1).otherwise(0))

For the join to work, the join key column has to exist in both DataFrames. For that reason, I first call withColumn("id", ...) to rename the columns to the same value for the join.

Next we do a LEFT join to keep all of the columns in df2. Finally we select all of the columns in df2 and use pyspark.sql.functions.when() to create the boolean column.

The first argument to when() is a condition. If it is True, the second argument is returned. If not, the value in otherwise() is used.

Upvotes: 6

Related Questions