Pyspark: Create new column of the set of values in a groupby

Question

I have a pyspark dataframe like this:

df = pd.DataFrame({"Date": ["2020-05-10", "2020-05-10", "2020-05-10", "2020-05-11", "2020-05-11", "2020-05-12", ], "Mode": ['A', 'B', 'A', 'C', 'C', 'B']})

df = spark.createDataFrame(df)

+----------+----+
|      Date|Mode|
+----------+----+
|2020-05-10|   A|
|2020-05-10|   B|
|2020-05-10|   A|
|2020-05-11|   C|
|2020-05-11|   C|
|2020-05-12|   B|
+----------+----+

And I would like to groupby the Date and create a new column of the set of values in the Mode column like so:

df = pd.DataFrame({"Date": ["2020-05-10", "2020-05-10", "2020-05-10", "2020-05-11", "2020-05-11", "2020-05-12", ], "Mode": ['A', 'B', 'A', 'C', 'C', 'B'], "set(Mode)": [['A', 'B'], ['A', 'B'], ['A', 'B'], ['C'], ['C'], ['B']]})

df = spark.createDataFrame(df)

+----------+----+---------+
|      Date|Mode|set(Mode)|
+----------+----+---------+
|2020-05-10|   A|   [A, B]|
|2020-05-10|   B|   [A, B]|
|2020-05-10|   A|   [A, B]|
|2020-05-11|   C|      [C]|
|2020-05-11|   C|      [C]|
|2020-05-12|   B|      [B]|
+----------+----+---------+

anky · Accepted Answer

You can try collect_set over a window:

import pyspark.sql.functions as F

df.withColumn("Set",F.collect_set('Mode')
                     .over(Window.partitionBy("Date"))).orderBy("Date").show()

+----------+----+------+
|      Date|Mode|   Set|
+----------+----+------+
|2020-05-10|   A|[B, A]|
|2020-05-10|   A|[B, A]|
|2020-05-10|   B|[B, A]|
|2020-05-11|   C|   [C]|
|2020-05-11|   C|   [C]|
|2020-05-12|   B|   [B]|
+----------+----+------+

If exact order matters:

(df.withColumn("idx",F.monotonically_increasing_id())
   .withColumn("Set",F.collect_set('Mode').over(Window.partitionBy("Date")))
   .orderBy("idx").drop("idx")).show()

+----------+----+------+
|      Date|Mode|   Set|
+----------+----+------+
|2020-05-10|   A|[B, A]|
|2020-05-10|   B|[B, A]|
|2020-05-10|   A|[B, A]|
|2020-05-11|   C|   [C]|
|2020-05-11|   C|   [C]|
|2020-05-12|   B|   [B]|
+----------+----+------+

Pyspark: Create new column of the set of values in a groupby

Answers (2)

Output

Related Questions