Reputation: 4482
I have the following pyspark dataframe:
import pandas as pd
foo = pd.DataFrame({'id': ['a','a','a','a', 'b','b','b','b'],
'time': [1,2,3,4,1,2,3,5],
'col': ['1','2','1','2','3','2','3','2']})
foo_df = spark.createDataFrame(foo)
foo_df.show()
+---+----+---+
| id|time|col|
+---+----+---+
| a| 1| 1|
| a| 2| 2|
| a| 3| 1|
| a| 4| 2|
| b| 1| 3|
| b| 2| 2|
| b| 3| 3|
| b| 5| 2|
+---+----+---+
I would like to have 1 row for each id
and a column which will contain a list
with the values from the col
column. The output would look like this:
+---+------------------+
| id| col|
+---+------------------+
| a| ['1','2','1','2']|
| b| ['3','2','3','2']|
+---+------------------+
Upvotes: 0
Views: 105
Reputation: 964
You can do that with a goupBy
on the column id
followed by a collect_list
on the column col
:
import pyspark.sql.functions as F
list_df = foo_df.groupBy(F.col("id")).agg(F.collect_list(F.col("col")).alias("col"))
list_df.show()
output:
+---+------------+
| id| col|
+---+------------+
| a|[1, 2, 1, 2]|
| b|[3, 2, 3, 2]|
+---+------------+
Upvotes: 2