Reputation: 47
I have the following Spark dataframe:
column1|column2
A|"1"
A|"1"
A|"2"
B|"1"
and I would like to get the count of each distinct value grouped by column1. The expected output would be something like this:
column1|column2
A|"1:2,2:1"
B|"1:1"
Any help here?
Upvotes: 2
Views: 1818
Reputation: 32670
Use groupby column1, column2
to count distinct values and then groupby again by column1
and collect list of pairs column2:count
. Something like this:
data = [("A", "1"), ("A", "1"),
("A", "2"), ("B", "1")]
df = spark.createDataFrame(data, ["column1", "column2"])
df.groupBy("column1", "column2").agg(count("*").alias("ct")) \
.groupBy("column1") \
.agg(collect_list(concat(col("column2"), lit(":"), col("ct"))).alias("result")) \
.drop("column2", "ct")\
.show()
Gives:
+-------+----------+
|column1| result|
+-------+----------+
| B| [1:1]|
| A|[1:2, 2:1]|
+-------+----------+
Upvotes: 0
Reputation: 2349
The simpler way is grouping by column1 and column2:
df2 = df.groupBy(df.column1, df.column2).count()
so you'll get something like:
column1 | column2 | count
A | "1" | 2
A | "2" | 1
B | "1" | 1
This will be the easiest way of working with the desired dataset. If you want your dataset, you can now concatenate column2
and count
and then group again by column1
and concatenate the group elements.
Upvotes: 1