Mean across rows of unique elements in ArrayType Pyspark column

Question

I have a big pyspark dataframe (23M rows) with the following format:

names, sentiment
["Lily","Kerry","Mona"], 10
["Kerry", "Mona"], 2
["Mona"], 0

I would like to compute the average sentiment for each unique name in the names column, resulting into:

name, sentiment
"Lily", 10
"Kerry", 6
"Mona", 4

Shubham Jain · Accepted Answer

Simply explode the array and then group

Pyspark equivalent

import pyspark.sql.functions as f
df1 = df.select(f.explode('names').alias('name'),'sentiment')

df1.groupBy('name').agg(f.avg('sentiment').alias('sentiment')).show()

Mean across rows of unique elements in ArrayType Pyspark column

Answers (2)

Related Questions