Concatenating lists in PySpark

Question

In my Spark Dataframe, one of the columns is of strings

Activities
"1 1 1 1 0 0 0 0 0"
"0 0 0 1 1 1 0 0 0"
"1 1 1 1 0 0 0 0 0"
"0 0 0 1 1 1 0 0 0"
"1 1 1 1 0 0 0 0 0"
"0 0 0 1 1 1 0 0 0"

I wish to collect strings from each row of this column and make a single list by concatenation. Then, split this huge string to make a huge single integer array like

[1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0,...]

(Of course, one can split the strings first, into lists, then append all the lists to form a big list, but the issue of How to concatenation RDD based lists remains)

Using pythons local data structures I can do:

import pyspark.sql.functions as F

allActivities = []
activitiesListColumn = df.agg(F.collect_list("Activities").alias("Activities")).collect()[0]
for rowActivity in activitiesListColumn["Activities"]:
    activities = rowActivity.split()
    allActivities += activities
print(allActivities)

How to get this done with RDD based (ie parallel-ized) data structures?

Concatenating lists in PySpark

Answers (1)

Related Questions