Reputation: 841
This is my basic Dataframe:
root |-- user_id: string (nullable = true)
|-- review_id: string (nullable = true)
|-- review_influence: double (nullable = false)
The goal is to have the sum of review_influence for each user_id. So I tried to aggregate the data and sum it up like this:
val review_influence_listDF = review_with_influenceDF
.groupBy("user_id")
.agg(collect_list("review_id") as("list_review_id"), collect_list("review_influence") as ("list_review_influence"))
.agg(sum($"list_review_influence"))
But I have this error:
org.apache.spark.sql.AnalysisException: cannot resolve 'sum(`list_review_influence`)' due to data type mismatch: function sum requires numeric types, not ArrayType(DoubleType,true);;
What can I do about it?
Upvotes: 0
Views: 1553
Reputation: 214927
You can directly sum the column in the agg
function:
review_with_influenceDF
.groupBy("user_id")
.agg(collect_list($"review_id").as("list_review_id"),
sum($"review_influence").as("sum_review_influence"))
Upvotes: 1