PySpark remove double brackets after collect_set of list

Question

I want to remove the double brackets after collect_set ?

Input data :

DF = [('1',  '[132]'),
      ('1',  '[184, 88]'),
      ('2',  '[55]'),
      ('2',  '[123,33]'),]

DF = spark.sparkContext.parallelize(DF).toDF(['id', 'codes'])

DF.groupBy("id").agg(F.collect_set("codes").alias("codes_concat")).show(4)

+---+------------------+
| id|      codes_concat|
+---+------------------+
|  1|[[184, 88], [132]]|
|  2|  [[123,33], [55]]|
+---+------------------+

How do I get a simple list instead:

+---+------------------+
| id|      codes_concat|
+---+------------------+
|  1|  [184, 88, 132]  |
|  2|  [123,33, 55]    |
+---+------------------+

过过招 · Accepted Answer

You can use the translate function to remove the [ and ] first, and then use the collect_set function to aggregate.

DF.groupBy("id").agg(F.collect_set(F.translate("codes", "[]", "")).alias("codes_concat")).show(4)

PySpark remove double brackets after collect_set of list

Answers (2)

Related Questions