D556
D556

Reputation: 11

Pyspark delete multiple columns after join Programmatically

columnToDelete=[empDFTems2.name,empDFTems.gender]

listjoin = empDFTems.join(empDFTems2, (empDFTems["emp_id"]==empDFTems2["emp_id"]), "left").drop(*columnToDelete)

It wont take the list since its a dataframe name. How can i programmatically drop all columns for the given dataframe after the join, reading which column to drop from the list

Upvotes: 1

Views: 1272

Answers (1)

Dipayan
Dipayan

Reputation: 11

According to PySpark doc you can only pass list of col names as strings. if you want to pass a column object only a single value is taken .

drop(*cols)

Returns a new DataFrame that drops the specified column. This is a no-op if schema doesn’t contain the given column name(s).

Parameters: cols – a string name of the column to drop, or a Column to drop, or a list of string name of the columns to drop.

I would suggest to select the required columns rather dropping the unnecessary ones. select accepts list of column objects so it should work fine.

empDFTems.show()
+------+-------+------+---+-----+
|emp_id|   name|gender|age| dept|
+------+-------+------+---+-----+
|     1| alis R|     F| 34|   IT|
|     2|Robin M|     M| 44|Sales|
+------+-------+------+---+-----+

empDFTems2.show()
+------+-----+------+-------+------+
|emp_id| name|gender|country|active|
+------+-----+------+-------+------+
|     1| alis|Female|     34|    IT|
|     2|Robin|  Male|     44| Sales|
+------+-----+------+-------+------+

#columnToDelete=[empDFTems2.name,empDFTems.gender]
#lets take delete columns separately if possible 
empDFTems2columnToDelete = ["name"]
empDFTemscolumnToDelete = ["gender"]

selectCols = [empDFTems[i] for i in empDFTems.columns if i not in empDFTemscolumnToDelete] + [empDFTems2[i] for i in empDFTems2.columns if i not in empDFTemscolumnToDelete]

listjoin = empDFTems.join(empDFTems2, (empDFTems["emp_id"]==empDFTems2["emp_id"]), "left").select(selectCols)

listjoin.show()

+------+-------+---+-----+------+------+-------+------+
|emp_id|   name|age| dept|emp_id|gender|country|active|
+------+-------+---+-----+------+------+-------+------+
|     1| alis R| 34|   IT|     1|Female|     34|    IT|
|     2|Robin M| 44|Sales|     2|  Male|     44| Sales|
+------+-------+---+-----+------+------+-------+------+

Alternatively if you must use drop, then the only option is looping over the columns

columnToDelete=[empDFTems2.name,empDFTems.gender]
listjoin = empDFTems.join(empDFTems2, (empDFTems["emp_id"]==empDFTems2["emp_id"]), "left")
for i in columnToDelete:
    listjoin = listjoin.drop(i)
    
listjoin.show()
+------+-------+---+-----+------+------+-------+------+
|emp_id|   name|age| dept|emp_id|gender|country|active|
+------+-------+---+-----+------+------+-------+------+
|     1| alis R| 34|   IT|     1|Female|     34|    IT|
|     2|Robin M| 44|Sales|     2|  Male|     44| Sales|
+------+-------+---+-----+------+------+-------+------+

Upvotes: 1

Related Questions