Reputation: 11
columnToDelete=[empDFTems2.name,empDFTems.gender]
listjoin = empDFTems.join(empDFTems2, (empDFTems["emp_id"]==empDFTems2["emp_id"]), "left").drop(*columnToDelete)
It wont take the list since its a dataframe name. How can i programmatically drop all columns for the given dataframe after the join, reading which column to drop from the list
Upvotes: 1
Views: 1272
Reputation: 11
According to PySpark doc you can only pass list of col names as strings. if you want to pass a column object only a single value is taken .
drop(*cols)
Returns a new DataFrame that drops the specified column. This is a no-op if schema doesn’t contain the given column name(s).
Parameters: cols – a string name of the column to drop, or a Column to drop, or a list of string name of the columns to drop.
I would suggest to select the required columns rather dropping the unnecessary ones. select accepts list of column objects so it should work fine.
empDFTems.show()
+------+-------+------+---+-----+
|emp_id| name|gender|age| dept|
+------+-------+------+---+-----+
| 1| alis R| F| 34| IT|
| 2|Robin M| M| 44|Sales|
+------+-------+------+---+-----+
empDFTems2.show()
+------+-----+------+-------+------+
|emp_id| name|gender|country|active|
+------+-----+------+-------+------+
| 1| alis|Female| 34| IT|
| 2|Robin| Male| 44| Sales|
+------+-----+------+-------+------+
#columnToDelete=[empDFTems2.name,empDFTems.gender]
#lets take delete columns separately if possible
empDFTems2columnToDelete = ["name"]
empDFTemscolumnToDelete = ["gender"]
selectCols = [empDFTems[i] for i in empDFTems.columns if i not in empDFTemscolumnToDelete] + [empDFTems2[i] for i in empDFTems2.columns if i not in empDFTemscolumnToDelete]
listjoin = empDFTems.join(empDFTems2, (empDFTems["emp_id"]==empDFTems2["emp_id"]), "left").select(selectCols)
listjoin.show()
+------+-------+---+-----+------+------+-------+------+
|emp_id| name|age| dept|emp_id|gender|country|active|
+------+-------+---+-----+------+------+-------+------+
| 1| alis R| 34| IT| 1|Female| 34| IT|
| 2|Robin M| 44|Sales| 2| Male| 44| Sales|
+------+-------+---+-----+------+------+-------+------+
Alternatively if you must use drop, then the only option is looping over the columns
columnToDelete=[empDFTems2.name,empDFTems.gender]
listjoin = empDFTems.join(empDFTems2, (empDFTems["emp_id"]==empDFTems2["emp_id"]), "left")
for i in columnToDelete:
listjoin = listjoin.drop(i)
listjoin.show()
+------+-------+---+-----+------+------+-------+------+
|emp_id| name|age| dept|emp_id|gender|country|active|
+------+-------+---+-----+------+------+-------+------+
| 1| alis R| 34| IT| 1|Female| 34| IT|
| 2|Robin M| 44|Sales| 2| Male| 44| Sales|
+------+-------+---+-----+------+------+-------+------+
Upvotes: 1