Reputation: 429
I have 2 spark dataframes, and I want to add new column named "seg" to dataframe df2 based on below condition
I tried below operation in pyspark but its throwing exception.
cc002 = df2.withColumn('seg',F.when(df2.colx == df1.colx,"True").otherwise("FALSE"))
id colx coly
1 678 56789
2 900 67890
3 789 67854
Name colx
seema 900
yash 678
deep 800
harsh 900
Name colx seg
seema 900 True
harsh 900 True
yash 678 True
deep 800 False
Please help me correcting the given pyspark code or suggest the better way of doing it.
Upvotes: 0
Views: 802
Reputation: 1464
If I understand your question correctly what you want to do is this
res = df2.join(
df1,
on="colx",
how = "left"
).select(
"Name",
"colx"
).withColumn(
"seg",
F.when(F.col(colx).isNull(),F.lit(True)).otherwise(F.lit(False))
)
let me know if this is the solution you want.
my bad i did write the incorrect code in hurry below is the corrected one
import pyspark.sql.functions as F
df1 = sqlContext.createDataFrame([[1,678,56789],[2,900,67890],[3,789,67854]],['id', 'colx', 'coly'])
df2 = sqlContext.createDataFrame([["seema",900],["yash",678],["deep",800],["harsh",900]],['Name', 'colx'])
res = df2.join(
df1.withColumn(
"check",
F.lit(1)
),
on="colx",
how = "left"
).withColumn(
"seg",
F.when(F.col("check").isNotNull(),F.lit(True)).otherwise(F.lit(False))
).select(
"Name",
"colx",
"seg"
)
res.show()
+-----+----+-----+
| Name|colx| seg|
+-----+----+-----+
| yash| 678| true|
|seema| 900| true|
|harsh| 900| true|
| deep| 800|false|
+-----+----+-----+
Upvotes: 1
Reputation: 19885
You can join on colx
and fill null values with False
:
result = (df2.join(df1.select(df1['colx'], F.lit(True).alias('seg')),
on='colx',
how='left')
.fillna(False, subset='seg'))
result.show()
Output:
+----+-----+-----+
|colx| Name| seg|
+----+-----+-----+
| 900|seema| true|
| 900|harsh| true|
| 800| deep|false|
| 678| yash| true|
+----+-----+-----+
Upvotes: 1