Tilo
Tilo

Reputation: 429

New column creation based on if and else condition using pyspark

I have 2 spark dataframes, and I want to add new column named "seg" to dataframe df2 based on below condition

I tried below operation in pyspark but its throwing exception.

cc002 = df2.withColumn('seg',F.when(df2.colx == df1.colx,"True").otherwise("FALSE"))
id  colx  coly
1   678   56789
2   900   67890
3   789   67854
Name   colx
seema  900
yash   678
deep   800
harsh  900
Name  colx   seg
seema 900    True
harsh 900    True
yash  678    True
deep  800    False

Please help me correcting the given pyspark code or suggest the better way of doing it.

Upvotes: 0

Views: 802

Answers (2)

Ankit Kumar Namdeo
Ankit Kumar Namdeo

Reputation: 1464

If I understand your question correctly what you want to do is this

res = df2.join(
    df1,
    on="colx",
    how = "left"
).select(
    "Name",
    "colx"
).withColumn(
    "seg",
    F.when(F.col(colx).isNull(),F.lit(True)).otherwise(F.lit(False))
)

let me know if this is the solution you want.

my bad i did write the incorrect code in hurry below is the corrected one

import pyspark.sql.functions as F

df1 = sqlContext.createDataFrame([[1,678,56789],[2,900,67890],[3,789,67854]],['id', 'colx', 'coly'])

df2 = sqlContext.createDataFrame([["seema",900],["yash",678],["deep",800],["harsh",900]],['Name', 'colx'])

res = df2.join(
    df1.withColumn(
        "check",
        F.lit(1)
    ),
    on="colx",
    how = "left"
).withColumn(
    "seg",
    F.when(F.col("check").isNotNull(),F.lit(True)).otherwise(F.lit(False))
).select(
    "Name",
    "colx",
    "seg"
)

res.show()

+-----+----+-----+
| Name|colx|  seg|
+-----+----+-----+
| yash| 678| true|
|seema| 900| true|
|harsh| 900| true|
| deep| 800|false|
+-----+----+-----+

Upvotes: 1

gmds
gmds

Reputation: 19885

You can join on colx and fill null values with False:

result = (df2.join(df1.select(df1['colx'], F.lit(True).alias('seg')), 
                   on='colx', 
                   how='left')
             .fillna(False, subset='seg'))

result.show()

Output:

+----+-----+-----+
|colx| Name|  seg|
+----+-----+-----+
| 900|seema| true|
| 900|harsh| true|
| 800| deep|false|
| 678| yash| true|
+----+-----+-----+

Upvotes: 1

Related Questions