Reputation: 969
I have a dataframe with columns from a source file which is not consistent and new columns can be added or removed per each load
i created a list to for the columns that are required and i'm trying to add new columns that are already missing from the dataframe by checking with my list
req_cols = ["A","B","C","D","E","F","G"]
df.show()
#+---+-----+---+---+----+
#| A| B | C | D | E |
#+---+-----+---+---+----+
#| 5 | 10 | 8 | 9 | 0 |
#+---+-----+---+---+----+
i now check to see if the columns exist if the dataframe and if not , i plan to add
for cols in req_cols:
if cols not in df.columns:
df = df.withColumns(cols,lit(None))
i'm facing an error which says cols should be a string or a valid spark column , what am i doing wrong ? also does my dataframe keep overwriting always ? what alternate solution can i use ?
my required output after adding the missing 2 columns
#+---+-----+---+---+----+-----+-----+
#| A| B | C | D | E | F | G |
#+---+-----+---+---+----+-----+-----+
#| 5 | 10 | 8 |9 | 0 | | |
#+---+-----+---+---+----+-----+-----+
Upvotes: 0
Views: 2718
Reputation: 2946
It should be df.withColumn
without s
.
The following works for me:
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
spark = SparkSession.builder.getOrCreate()
data = [
{"A": 5, "B": 10, "C": 8, "D": 9, "E": 0},
]
df = spark.createDataFrame(data)
req_cols = ["A", "B", "C", "D", "E", "F", "G"]
for col in req_cols:
if col not in df.columns:
df = df.withColumn(col, F.lit(None))
Result:
+---+---+---+---+---+----+----+
|A |B |C |D |E |F |G |
+---+---+---+---+---+----+----+
|5 |10 |8 |9 |0 |null|null|
+---+---+---+---+---+----+----+
Upvotes: 2