Bendemann
Bendemann

Reputation: 776

Applying map function on dataframe's columns

I need to merge all the values of the dataframe's columns into a single value for each column. So the columns stay intact but I am just summing all the respective values. For this purpose I intend to utilize this function:

def sum_col(data, col):
    return data.select(f.sum(col)).collect()[0][0]

I was now thinking to do sth like this:

data = data.map(lambda current_col: sum_col(data, current_col))

Is this doable, or I need another way to merge all the values of the columns?

Upvotes: 3

Views: 217

Answers (2)

Shubham Jain
Shubham Jain

Reputation: 5526

You can achieve this by sum function

import pyspark.sql.functions as f
df.select(*[f.sum(cols).alias(cols) for cols in df.columns]).show()

+----+---+---+
|val1|  x|  y|
+----+---+---+
|  36| 29|159|
+----+---+---+

Upvotes: 2

Raghu
Raghu

Reputation: 1712

To sum all your columns to a new column you can use list comprehension with the sum function of python

import pyspark.sql.functions as F
from pyspark.sql.functions import udf
from pyspark.sql.types import *
tst= sqlContext.createDataFrame([(10,7,14),(5,1,4),(9,8,10),(2,6,90),(7,2,30),(3,5,11)],schema=['val1','x','y'])
tst_sum= tst.withColumn("sum_col",sum([tst[coln] for coln in tst.columns]))

results:

tst_sum.show()
+----+---+---+-------+
|val1|  x|  y|sum_col|
+----+---+---+-------+
|  10|  7| 14|     31|
|   5|  1|  4|     10|
|   9|  8| 10|     27|
|   2|  6| 90|     98|
|   7|  2| 30|     39|
|   3|  5| 11|     19|
+----+---+---+-------+

Note : If you had imported sum function from pyspark function as from import pyspark.sql.functions import sum then you have to change the name to some thing else , like from import pyspark.sql.functions import sum_pyspark

Upvotes: 1

Related Questions