Add columns on a Pyspark Dataframe

Question

I have a Pyspark Dataframe with this structure:

+----+----+----+----+---+
|user| A/B|   C| A/B| C | 
+----+----+-------------+
|  1 |   0|   1|   1|  2| 
|  2 |   0|   2|   4|  0| 
+----+----+----+----+---+

I had originally two dataframes, but I outer joined them using user as key, so there could be also null values. I can't find the way to sum the columns with equal name in order to get a dataframe like this:

+----+----+----+
|user| A/B|   C| 
+----+----+----+
|  1 |   1|   3| 
|  2 |   4|   2| 
+----+----+----+

Also note that there could be many equal columns, so selecting literally each column is not an option. In pandas this was possible using "user" as Index and then adding both dataframes. How can I do this on Spark?

Shivansh · Accepted Answer

I have a work around for this

val dataFrameOneColumns=df1.columns.map(a=>if(a.equals("user")) a else a+"_1")
val updatedDF=df1.toDF(dataFrameOneColumns:_*)

Now make the Join then the out will contain the Values with different names

Then make the tuple of the list to be combined

val newlist=df1.columns.filter(_.equals("user").zip(dataFrameOneColumns.filter(_.equals("user"))

And them Combine the value of the Columns within each tuple and get the desired output !

PS: i am guessing you can write the logic for combining ! So i am not spoon feeding !

Add columns on a Pyspark Dataframe

Answers (1)

Related Questions