Reputation: 39
I am trying to add a new column to an existing spark df. If I specify the df column name as the new value for the new column than it works, but since i want the value column to be dynamic based on configs i want to pass the value from a variable.
e.g:
>>> df1.printSchema()
root
|-- COL_A: string (nullable = true)
|-- COL_B: string (nullable = true)
|-- COL_C: string (nullable = true)
if I use df2 = df1.withColumn("COL_D", lit(df1.COL_A))
then it works as expected.
However if i have variable and try to pass that than it does not work.
val_col = "COL_B"
df2 = df1.withColumn("COL_D", lit(df1.val_col))
I am not sure if this is even possible, but wanted to ask .Let me know if anyone has done similar thing before.
Upvotes: 1
Views: 5680
Reputation: 7587
Use col
function to avoid this issue.
df = sqlContext.createDataFrame([(1,'Björn'),(2,'Oliver'),(3,'Müller')],['ID','Name'])
df.show()
+---+------+
| ID| Name|
+---+------+
| 1| Björn|
| 2|Oliver|
| 3|Müller|
+---+------+
df1 = df.withColumn('New_ID',lit(df.ID))
df1.show()
+---+------+------+
| ID| Name|New_ID|
+---+------+------+
| 1| Björn| 1|
| 2|Oliver| 2|
| 3|Müller| 3|
+---+------+------+
So far so good. But, the moment we assign a column name to a variable, we get an error, as demonstrated below -
val_col = "ID"
df1 = df.withColumn('New_ID',lit(df.val_col))
AttributeErrorTraceback (most recent call last)
<ipython-input-48-1bb287cfa9f2> in <module>
5
6 val_col = "ID"
----> 7 df1 = df.withColumn('New_ID',lit(df.val_col))
8
9 from pyspark.sql.functions import col
/opt/mapr/spark/spark-2.2.1/python/pyspark/sql/dataframe.py in __getattr__(self, name)
1018 if name not in self.columns:
1019 raise AttributeError(
-> 1020 "'%s' object has no attribute '%s'" % (self.__class__.__name__, name))
1021 jc = self._jdf.apply(name)
1022 return Column(jc)
AttributeError: 'DataFrame' object has no attribute 'val_col'
You get this error because there is no variable named val_col
, and Python assumes what follows after the dot as a column name. It doesn't take string per-se.
Solution: You can avoid this issue all together by importing col
function and using it to do your operations.
from pyspark.sql.functions import col
val_col = "ID"
df1 = df.withColumn('New_ID',lit(col(val_col)))
df1.show()
+---+------+------+
| ID| Name|New_ID|
+---+------+------+
| 1| Björn| 1|
| 2|Oliver| 2|
| 3|Müller| 3|
+---+------+------+
Upvotes: 1