How to create a column of lists from a column in pyspark

Question

I have the following pyspark dataframe:

import pandas as pd
foo = pd.DataFrame({'id': ['a','a','a','a', 'b','b','b','b'],
                    'time': [1,2,3,4,1,2,3,5],
                    'col': ['1','2','1','2','3','2','3','2']})

foo_df = spark.createDataFrame(foo)
foo_df.show()

+---+----+---+
| id|time|col|
+---+----+---+
|  a|   1|  1|
|  a|   2|  2|
|  a|   3|  1|
|  a|   4|  2|
|  b|   1|  3|
|  b|   2|  2|
|  b|   3|  3|
|  b|   5|  2|
+---+----+---+

I would like to have 1 row for each id and a column which will contain a list with the values from the col column. The output would look like this:

+---+------------------+
| id|               col|
+---+------------------+
|  a| ['1','2','1','2']|
|  b| ['3','2','3','2']|
+---+------------------+

fskj · Accepted Answer

You can do that with a goupBy on the column id followed by a collect_list on the column col:

import pyspark.sql.functions as F
list_df = foo_df.groupBy(F.col("id")).agg(F.collect_list(F.col("col")).alias("col"))
list_df.show()

output:

+---+------------+
| id|         col|
+---+------------+
|  a|[1, 2, 1, 2]|
|  b|[3, 2, 3, 2]|
+---+------------+

How to create a column of lists from a column in pyspark

Answers (1)

Related Questions