Reputation: 165
I have a table represented by spark Dataset< Row >
origin.show();
+------+
|Origin|
+------+
| USA|
| Japan|
| USA|
| USA|
| Japan|
|Europe|
+------+
I want to build additional "countByValue" column to get table like
+------+-----+
|Origin|Count|
+------+-----+
|Europe| 1|
| USA| 3|
| USA| 3|
| USA| 3|
| Japan| 2|
| Japan| 2|
+------+-----+
I found solution but it seems very inefficient. I group origin dataset and use count function.
Dataset<Row> grouped = origin.groupBy(originCol).agg(functions.count(originCol));
grouped.show();
+------+-----+
|Origin|Count|
+------+-----+
|Europe| 1|
| USA| 3|
| Japan| 2|
+------+-----+
Then I just join result table with origin dataset.
Dataset<Row> finalDs = origin.join(grouped, originCol);
Is there any other more efficiant way to perform such operation?
Upvotes: 2
Views: 1321
Reputation: 16086
You can write query with Window:
origin.withColumn("cnt", count('Origin).over(Window.partitionBy('Origin)))
Remember to import org.apache.spark.sql.functions._
and org.apache.spark.sql.expressions.Window
Upvotes: 4
Reputation: 23109
This is what you need to do
org.apache.sql.functions._
val df = Seq(
("USA"),
("Japan"),
("USA"),
("USA"),
("Japan"),
("Europe")
).toDF("origin")
val result = df.groupBy("origin").agg(collect_list($"origin").alias("origin1"),
count("origin").alias("count"))
.withColumn("origin", explode($"origin1")).drop("origin")
Upvotes: 1