Aggregate over a column of strings Pyspark

Question

I have the following Spark dataframe:

column1|column2
A|"1"
A|"1"
A|"2"
B|"1"

and I would like to get the count of each distinct value grouped by column1. The expected output would be something like this:

column1|column2
A|"1:2,2:1"
B|"1:1"

Any help here?

blackbishop · Accepted Answer

Use groupby column1, column2 to count distinct values and then groupby again by column1 and collect list of pairs column2:count. Something like this:

data = [("A", "1"), ("A", "1"),
        ("A", "2"), ("B", "1")]
df = spark.createDataFrame(data, ["column1", "column2"])

df.groupBy("column1", "column2").agg(count("*").alias("ct")) \
    .groupBy("column1") \
    .agg(collect_list(concat(col("column2"), lit(":"), col("ct"))).alias("result")) \
    .drop("column2", "ct")\
    .show()

Gives:

+-------+----------+
|column1|    result|
+-------+----------+
|      B|     [1:1]|
|      A|[1:2, 2:1]|
+-------+----------+

Aggregate over a column of strings Pyspark

Answers (2)

Related Questions