Reputation: 5670
How can I sum multiple columns in Spark? For example, in SparkR the following code works to get the sum of one column, but if I try to get the sum of both columns in df
, I get an error.
# Create SparkDataFrame
df <- createDataFrame(faithful)
# Use agg to sum total waiting times
head(agg(df, totalWaiting = sum(df$waiting)))
##This works
# Use agg to sum total of waiting and eruptions
head(agg(df, total = sum(df$waiting, df$eruptions)))
##This doesn't work
Either SparkR or PySpark code will work.
Upvotes: 15
Views: 97283
Reputation: 114
The accepted answer was helpful for me, but I found out the one below is simpler and it does not use external API.
sum_df = df.withColumn('total', lit(0))
for c in col_list:
sum_df = sum_df.withColumn('total', col('total') + col(c))
Upvotes: 1
Reputation: 41
You can use expr():
import pyspark.sql.functions as f
numeric_cols = ['col_a','col_b','col_c']
df = df.withColumn('total', f.expr('+'.join(cols)))
PySpark expr() is a SQL function to execute SQL-like expressions.
Upvotes: 4
Reputation: 1516
For PySpark, if you don't want to explicitly type out the columns:
from operator import add
from functools import reduce
new_df = df.withColumn('total',reduce(add, [F.col(x) for x in numeric_col_list]))
Upvotes: 61
Reputation: 66
sparkR code:
library(SparkR)
df <- createDataFrame(sqlContext,faithful)
w<-agg(df,sum(df$waiting)),agg(df,sum(df$eruptions))
head(w[[1]])
head(w[[2]])
Upvotes: 3
Reputation: 5710
org.apache.spark.sql.functions.sum(Column e)
Aggregate function: returns the sum of all values in the expression.
As you can see, sum
takes just one column as input so sum(df$waiting, df$eruptions)
wont work.Since you wan to sum up the numeric fields, you can dosum(df("waiting") + df("eruptions"))
.If you wan to sum up values for individual columns then, you can df.agg(sum(df$waiting),sum(df$eruptions)).show
Upvotes: 5
Reputation: 11587
you can do something like the below in pyspark
>>> from pyspark.sql import functions as F
>>> df = spark.createDataFrame([("a",1,10), ("b",2,20), ("c",3,30), ("d",4,40)], ["col1", "col2", "col3"])
>>> df.groupBy("col1").agg(F.sum(df.col2+df.col3)).show()
+----+------------------+
|col1|sum((col2 + col3))|
+----+------------------+
| d| 44|
| c| 33|
| b| 22|
| a| 11|
+----+------------------+
Upvotes: 9