Reputation: 77
I am very new to pySpark. Appreciate your help.. I have a dataframe
test["1"]={"vars":["x1","x2"]}
test["2"]={"vars":["x2"]}
test["3"]={"vars":["x3"]}
test["4"]={"vars":["x2","x3"]}
pdDF = pd.DataFrame(test).transpose()
sparkDF=spark.createDataFrame(pdDF)
+--------+
| vars|
+--------+
|[x1, x2]|
| [x2]|
| [x3]|
|[x2, x3]|
+--------+
I am looking for a way to group column "vars" by values in the list and count I am looking for next result:
+-----+---+
|count|var|
+-----+---+
| 1| x1|
| 3| x2|
| 2| x3|
+-----+---+
Can somebody advise how to achieve this?
Thanks in advance!
Upvotes: 1
Views: 63
Reputation: 1698
from pyspark.sql.functions import explode
values = [(["x1","x2"],),(["x2"],),(["x3"],),(["x2","x3"],)]
df = sqlContext.createDataFrame(values,['vars'])
df.show()
+--------+
| vars|
+--------+
|[x1, x2]|
| [x2]|
| [x3]|
|[x2, x3]|
+--------+
newdf=df.withColumn("vars2", explode(df.vars))
newdf.groupBy('vars2').count().show()
+-----+-----+
|vars2|count|
+-----+-----+
| x2| 3|
| x3| 2|
| x1| 1|
+-----+-----+
Upvotes: 2