pyspark adding columns to dataframe that are already not present from a list

Question

I have a dataframe with columns from a source file which is not consistent and new columns can be added or removed per each load

i created a list to for the columns that are required and i'm trying to add new columns that are already missing from the dataframe by checking with my list

req_cols = ["A","B","C","D","E","F","G"]
df.show()
#+---+-----+---+---+----+
#| A|   B  | C | D |  E |
#+---+-----+---+---+----+
#| 5 | 10  | 8 | 9 |  0 |
#+---+-----+---+---+----+

i now check to see if the columns exist if the dataframe and if not , i plan to add

for cols in req_cols:
    if cols not in df.columns:
         df = df.withColumns(cols,lit(None))

i'm facing an error which says cols should be a string or a valid spark column , what am i doing wrong ? also does my dataframe keep overwriting always ? what alternate solution can i use ?

my required output after adding the missing 2 columns

#+---+-----+---+---+----+-----+-----+
#| A|   B  | C | D |  E | F   |  G  |
#+---+-----+---+---+----+-----+-----+
#| 5 | 10  | 8 |9  |  0 |     |     |
#+---+-----+---+---+----+-----+-----+

pyspark adding columns to dataframe that are already not present from a list

Answers (1)

Related Questions