Wael Othmani
Wael Othmani

Reputation: 47

Aggregate over a column of strings Pyspark

I have the following Spark dataframe:

column1|column2
A|"1"
A|"1"
A|"2"
B|"1"

and I would like to get the count of each distinct value grouped by column1. The expected output would be something like this:

column1|column2
A|"1:2,2:1"
B|"1:1"

Any help here?

Upvotes: 2

Views: 1818

Answers (2)

blackbishop
blackbishop

Reputation: 32670

Use groupby column1, column2 to count distinct values and then groupby again by column1 and collect list of pairs column2:count. Something like this:

data = [("A", "1"), ("A", "1"),
        ("A", "2"), ("B", "1")]
df = spark.createDataFrame(data, ["column1", "column2"])

df.groupBy("column1", "column2").agg(count("*").alias("ct")) \
    .groupBy("column1") \
    .agg(collect_list(concat(col("column2"), lit(":"), col("ct"))).alias("result")) \
    .drop("column2", "ct")\
    .show() 

Gives:

+-------+----------+
|column1|    result|
+-------+----------+
|      B|     [1:1]|
|      A|[1:2, 2:1]|
+-------+----------+

Upvotes: 0

Daniel Argüelles
Daniel Argüelles

Reputation: 2349

The simpler way is grouping by column1 and column2:

df2 = df.groupBy(df.column1, df.column2).count()

so you'll get something like:

column1 | column2 | count 
A       | "1"     | 2 
A       | "2"     | 1 
B       | "1"     | 1

This will be the easiest way of working with the desired dataset. If you want your dataset, you can now concatenate column2 and count and then group again by column1 and concatenate the group elements.

Upvotes: 1

Related Questions