Reputation: 47
I am using Spark 1.5.2 with Python 2.7.5.
I have this code that I run in the pyspark repl:
from pyspark.sql import SQLContext
ctx = SQLContext(sc)
df = ctx.createDataFrame([("a",1),("a",1),("a",0),("a",0),("b",1),("b",0),("b",1)],["group","conversion"])
from pyspark.sql.functions import col, count, avg
funs = [(count,"total"),(avg,"cr")]
aggregate = ["conversion"]
exprs = [f(col(c)).alias(name) for f,name in funs for c in aggregate]
df3 = df.groupBy("group").agg(*exprs).cache()
So far the code works fine and I can check df3
:
>>> df3.collect()
[Row(group=u'a', total=4, cr=0.5), Row(group=u'b', total=3, cr=0.6666666666666666)]
However, when I try:
df3.agg(sum(col('cr'))).first()[0]
PySpark can't calculate that sum. However df3.rdd.reduce(lambda x,y: x[2]+y[2])
works just fine.
So, what is the issue with the first command to calculate the sum?
Upvotes: 0
Views: 1108
Reputation: 13936
You should import pyspark's sum
function first: from pyspark.sql.functions import sum
. Otherwise python's built-in sum
is called which just sums the sequence of numbers.
Upvotes: 1