Reputation: 776
I need to merge all the values of the dataframe's columns into a single value for each column. So the columns stay intact but I am just summing all the respective values. For this purpose I intend to utilize this function:
def sum_col(data, col):
return data.select(f.sum(col)).collect()[0][0]
I was now thinking to do sth like this:
data = data.map(lambda current_col: sum_col(data, current_col))
Is this doable, or I need another way to merge all the values of the columns?
Upvotes: 3
Views: 217
Reputation: 5526
You can achieve this by sum function
import pyspark.sql.functions as f
df.select(*[f.sum(cols).alias(cols) for cols in df.columns]).show()
+----+---+---+
|val1| x| y|
+----+---+---+
| 36| 29|159|
+----+---+---+
Upvotes: 2
Reputation: 1712
To sum all your columns to a new column you can use list comprehension with the sum function of python
import pyspark.sql.functions as F
from pyspark.sql.functions import udf
from pyspark.sql.types import *
tst= sqlContext.createDataFrame([(10,7,14),(5,1,4),(9,8,10),(2,6,90),(7,2,30),(3,5,11)],schema=['val1','x','y'])
tst_sum= tst.withColumn("sum_col",sum([tst[coln] for coln in tst.columns]))
results:
tst_sum.show()
+----+---+---+-------+
|val1| x| y|sum_col|
+----+---+---+-------+
| 10| 7| 14| 31|
| 5| 1| 4| 10|
| 9| 8| 10| 27|
| 2| 6| 90| 98|
| 7| 2| 30| 39|
| 3| 5| 11| 19|
+----+---+---+-------+
Note : If you had imported sum function from pyspark function as from import pyspark.sql.functions import sum
then you have to change the name to some thing else , like from import pyspark.sql.functions import sum_pyspark
Upvotes: 1