user19495470
user19495470

Reputation:

PySpark remove double brackets after collect_set of list

I want to remove the double brackets after collect_set ?

Input data :

DF = [('1',  '[132]'),
      ('1',  '[184, 88]'),
      ('2',  '[55]'),
      ('2',  '[123,33]'),]

DF = spark.sparkContext.parallelize(DF).toDF(['id', 'codes'])

DF.groupBy("id").agg(F.collect_set("codes").alias("codes_concat")).show(4)
+---+------------------+
| id|      codes_concat|
+---+------------------+
|  1|[[184, 88], [132]]|
|  2|  [[123,33], [55]]|
+---+------------------+

How do I get a simple list instead:

+---+------------------+
| id|      codes_concat|
+---+------------------+
|  1|  [184, 88, 132]  |
|  2|  [123,33, 55]    |
+---+------------------+

Upvotes: 1

Views: 721

Answers (2)

wwnde
wwnde

Reputation: 26676

Another way

new =(DF.withColumn('codes', regexp_replace('codes','\[|\]',''))#replace double brackets
      .groupBy("id").agg(F.collect_set("codes").alias("codes_concat"))#groupby
     ).show(4)

Upvotes: 0

过过招
过过招

Reputation: 4244

You can use the translate function to remove the [ and ] first, and then use the collect_set function to aggregate.

DF.groupBy("id").agg(F.collect_set(F.translate("codes", "[]", "")).alias("codes_concat")).show(4)

Upvotes: 1

Related Questions