Faliha Zikra
Faliha Zikra

Reputation: 421

Create columns in pyspark df from a list if the column doesn't already exist

Here is a list of values, I would like my dataframe to have :

 cols=['USA','CAN','UK','DEN']

My current df:

| ID | USA | DEN | VEN | NOR|
| 98 |  1  |  0  | 1   |  1 |
| 99 |  0  |  1  | 0   |  0 |

I want to check if my existing df has all the values in the list as columns, if not then create those columns and fill then with 0 like:

| ID | USA | DEN | VEN | NOR| CAN | UK|
| 98 |  1  |  0  | 1   |  1 |  0  | 0 |
| 99 |  0  |  1  | 0   |  0 |  0  | 0 |

Upvotes: 1

Views: 1168

Answers (2)

blackbishop
blackbishop

Reputation: 32720

You can use a simple select expression :

from pyspark.sql.functions import lit

select_cols = df.columns + [lit(0).alias(c) for c in cols if c not in df.columns]

df.select(*select_cols).show()

Upvotes: 1

notNull
notNull

Reputation: 31540

Try with for + if loop to check if column exists in df.columns or else add column with 0.

from pyspark.sql.functions import *

df=spark.createDataFrame([(98,1,0,1,1,)],['ID','USA','DEN','VEN','NOR'])
cols=['USA','CAN','UK','DEN']

for i in cols:
     if not i in df.columns:
        df=df.withColumn(i,lit("0"))

df.show()

#+---+---+---+---+---+---+---+
#| ID|USA|DEN|VEN|NOR|CAN| UK|
#+---+---+---+---+---+---+---+
#| 98|  1|  0|  1|  1|  0|  0|
#+---+---+---+---+---+---+---+

Upvotes: 2

Related Questions