Reputation: 11557
I am trying to sum a field which contains an array,
a = sc.parallelize([("a", [1,1,1]),
("a", [2,2])])
a = a.toDF(["g", "arr_val"])
a.registerTempTable('a')
sql = """
select
aggregate(arr_val, 0, (acc, x) -> acc + x) as sum
from a
"""
spark.sql(sql).show()
But I'm running into the following error:
An error occurred while calling o24.sql.
: org.apache.spark.sql.AnalysisException: cannot resolve 'aggregate(a.`arr_val`, 0, lambdafunction((CAST(namedlambdavariable() AS BIGINT) + namedlambdavariable()), namedlambdavariable(), namedlambdavariable()), lambdafunction(namedlambdavariable(), namedlambdavariable()))' due to data type mismatch: argument 3 requires int type, however, 'lambdafunction((CAST(namedlambdavariable() AS BIGINT) + namedlambdavariable()), namedlambdavariable(), namedlambdavariable())' is of bigint type.; line 3 pos 0;
How can I get this to work?
Upvotes: 0
Views: 1040
Reputation: 11557
You need to cast the values within the accumulator e.g. to a float:
a = sc.parallelize([("a", [1,1,1]),
("a", [2,2])])
a = a.toDF(["g", "arr_val"])
a.registerTempTable('a')
sql = """
select
aggregate(arr_val, cast(0 as float), (acc, x) -> acc + cast(x as float)) as sum
from a
"""
spark.sql(sql).show()
Upvotes: 4