Reputation: 101
I tried to follow this answer but my question is slightly different.
I have two pyspark data frames df2
and bears2
. Both have an integer variable, and I want to create a boolean like this pseudocode:
df3 = df2.withColumn("game", (df2.week_id.isin(bears2.week_if), 1,0))
Basically, if the value of df2
exists in the corresponding column of bears2
, I want a 1
else a 0
I tried the expr()
from the other question, but wasn't able to get it to work. It looked like this:
new_column_1 = F.expr(
"""IF(df2.week_id IN(bears2.week_if), 1, 0))"""
)
Upvotes: 7
Views: 11407
Reputation: 43544
You can't currently use IN
like that with pyspark-sql1. Instead, you will have to join the DataFrames.
Try something like:
from pyspark.sql.functions import col, when
df3 = df2.withColumn("id", col("week_id")).alias("df2")\
.join(bears2.withColumn("id", col("week_if")).alias("bears2"), on="id", how="left")\
.select("df2.*", when(col("bears2.id").isNotNull(), 1).otherwise(0))
For the join to work, the join key column has to exist in both DataFrames. For that reason, I first call withColumn("id", ...)
to rename the columns to the same value for the join.
Next we do a LEFT join to keep all of the columns in df2
. Finally we select all of the columns in df2
and use pyspark.sql.functions.when()
to create the boolean column.
The first argument to when()
is a condition. If it is True
, the second argument is returned. If not, the value in otherwise()
is used.
Upvotes: 6