marlanbar
marlanbar

Reputation: 167

Add columns on a Pyspark Dataframe

I have a Pyspark Dataframe with this structure:

+----+----+----+----+---+
|user| A/B|   C| A/B| C | 
+----+----+-------------+
|  1 |   0|   1|   1|  2| 
|  2 |   0|   2|   4|  0| 
+----+----+----+----+---+

I had originally two dataframes, but I outer joined them using user as key, so there could be also null values. I can't find the way to sum the columns with equal name in order to get a dataframe like this:

+----+----+----+
|user| A/B|   C| 
+----+----+----+
|  1 |   1|   3| 
|  2 |   4|   2| 
+----+----+----+

Also note that there could be many equal columns, so selecting literally each column is not an option. In pandas this was possible using "user" as Index and then adding both dataframes. How can I do this on Spark?

Upvotes: 1

Views: 353

Answers (1)

Shivansh
Shivansh

Reputation: 3544

I have a work around for this

val dataFrameOneColumns=df1.columns.map(a=>if(a.equals("user")) a else a+"_1")
val updatedDF=df1.toDF(dataFrameOneColumns:_*)

Now make the Join then the out will contain the Values with different names

Then make the tuple of the list to be combined

val newlist=df1.columns.filter(_.equals("user").zip(dataFrameOneColumns.filter(_.equals("user"))

And them Combine the value of the Columns within each tuple and get the desired output !

PS: i am guessing you can write the logic for combining ! So i am not spoon feeding !

Upvotes: 1

Related Questions