Oleg Zdanevich
Oleg Zdanevich

Reputation: 77

PySpark - Group by Array column

I am very new to pySpark. Appreciate your help.. I have a dataframe

test["1"]={"vars":["x1","x2"]}
test["2"]={"vars":["x2"]}
test["3"]={"vars":["x3"]}
test["4"]={"vars":["x2","x3"]}
pdDF = pd.DataFrame(test).transpose()
sparkDF=spark.createDataFrame(pdDF) 

+--------+
|    vars|
+--------+
|[x1, x2]|
|    [x2]|
|    [x3]|
|[x2, x3]|
+--------+

I am looking for a way to group column "vars" by values in the list and count I am looking for next result:


+-----+---+
|count|var|
+-----+---+
|    1| x1|
|    3| x2|
|    2| x3|
+-----+---+

Can somebody advise how to achieve this?

Thanks in advance!

Upvotes: 1

Views: 63

Answers (1)

Prathik Kini
Prathik Kini

Reputation: 1698

from pyspark.sql.functions import explode
values = [(["x1","x2"],),(["x2"],),(["x3"],),(["x2","x3"],)]
df = sqlContext.createDataFrame(values,['vars'])
df.show()

+--------+
|    vars|
+--------+
|[x1, x2]|
|    [x2]|
|    [x3]|
|[x2, x3]|
+--------+

newdf=df.withColumn("vars2", explode(df.vars))
newdf.groupBy('vars2').count().show()

+-----+-----+
|vars2|count|
+-----+-----+
|   x2|    3|
|   x3|    2|
|   x1|    1|
+-----+-----+

Upvotes: 2

Related Questions