Reputation: 421
Here is a list of values, I would like my dataframe to have :
cols=['USA','CAN','UK','DEN']
My current df:
| ID | USA | DEN | VEN | NOR|
| 98 | 1 | 0 | 1 | 1 |
| 99 | 0 | 1 | 0 | 0 |
I want to check if my existing df has all the values in the list as columns, if not then create those columns and fill then with 0 like:
| ID | USA | DEN | VEN | NOR| CAN | UK|
| 98 | 1 | 0 | 1 | 1 | 0 | 0 |
| 99 | 0 | 1 | 0 | 0 | 0 | 0 |
Upvotes: 1
Views: 1168
Reputation: 32720
You can use a simple select expression :
from pyspark.sql.functions import lit
select_cols = df.columns + [lit(0).alias(c) for c in cols if c not in df.columns]
df.select(*select_cols).show()
Upvotes: 1
Reputation: 31540
Try with for + if
loop to check if column exists in df.columns or else add column with 0.
from pyspark.sql.functions import *
df=spark.createDataFrame([(98,1,0,1,1,)],['ID','USA','DEN','VEN','NOR'])
cols=['USA','CAN','UK','DEN']
for i in cols:
if not i in df.columns:
df=df.withColumn(i,lit("0"))
df.show()
#+---+---+---+---+---+---+---+
#| ID|USA|DEN|VEN|NOR|CAN| UK|
#+---+---+---+---+---+---+---+
#| 98| 1| 0| 1| 1| 0| 0|
#+---+---+---+---+---+---+---+
Upvotes: 2