Reputation: 840
I have a big pyspark dataframe (23M rows) with the following format:
names, sentiment
["Lily","Kerry","Mona"], 10
["Kerry", "Mona"], 2
["Mona"], 0
I would like to compute the average sentiment for each unique name in the names column, resulting into:
name, sentiment
"Lily", 10
"Kerry", 6
"Mona", 4
Upvotes: 1
Views: 102
Reputation: 5526
Simply explode the array and then group
Pyspark equivalent
import pyspark.sql.functions as f
df1 = df.select(f.explode('names').alias('name'),'sentiment')
df1.groupBy('name').agg(f.avg('sentiment').alias('sentiment')).show()
Upvotes: 1
Reputation: 1214
val avgDF = Seq((Seq("Lily","Kerry","Mona"), 10),
(Seq("Kerry", "Mona"), 2),
(Seq("Mona"), 0)
).toDF("names", "sentiment")
val avgDF1 = avgDF.withColumn("name", explode('names))
val avgResultDF = avgDF1.groupBy("name").agg(avg(col("sentiment")))
avgResultDF.show(false)
// +-----+--------------+
// |name |avg(sentiment)|
// +-----+--------------+
// |Lily |10.0 |
// |Kerry|6.0 |
// |Mona |4.0 |
// +-----+--------------+
Upvotes: 1