RData
RData

Reputation: 969

pyspark adding columns to dataframe that are already not present from a list

I have a dataframe with columns from a source file which is not consistent and new columns can be added or removed per each load

i created a list to for the columns that are required and i'm trying to add new columns that are already missing from the dataframe by checking with my list

req_cols = ["A","B","C","D","E","F","G"]
df.show()
#+---+-----+---+---+----+
#| A|   B  | C | D |  E |
#+---+-----+---+---+----+
#| 5 | 10  | 8 | 9 |  0 |
#+---+-----+---+---+----+

i now check to see if the columns exist if the dataframe and if not , i plan to add

for cols in req_cols:
    if cols not in df.columns:
         df = df.withColumns(cols,lit(None))

i'm facing an error which says cols should be a string or a valid spark column , what am i doing wrong ? also does my dataframe keep overwriting always ? what alternate solution can i use ?

my required output after adding the missing 2 columns

#+---+-----+---+---+----+-----+-----+
#| A|   B  | C | D |  E | F   |  G  |
#+---+-----+---+---+----+-----+-----+
#| 5 | 10  | 8 |9  |  0 |     |     |
#+---+-----+---+---+----+-----+-----+

Upvotes: 0

Views: 2718

Answers (1)

vladsiv
vladsiv

Reputation: 2946

It should be df.withColumn without s.

The following works for me:

from pyspark.sql import SparkSession
from pyspark.sql import functions as F

spark = SparkSession.builder.getOrCreate()
data = [
    {"A": 5, "B": 10, "C": 8, "D": 9, "E": 0},
]

df = spark.createDataFrame(data)
req_cols = ["A", "B", "C", "D", "E", "F", "G"]
for col in req_cols:
    if col not in df.columns:
        df = df.withColumn(col, F.lit(None))

Result:

+---+---+---+---+---+----+----+                                                 
|A  |B  |C  |D  |E  |F   |G   |
+---+---+---+---+---+----+----+
|5  |10 |8  |9  |0  |null|null|
+---+---+---+---+---+----+----+

Upvotes: 2

Related Questions