Alexander Mann
Alexander Mann

Reputation: 165

How to perform count by value operation on spark's Dataset without grouping values?

I have a table represented by spark Dataset< Row >

origin.show();

+------+
|Origin|
+------+
|   USA|
| Japan|
|   USA|
|   USA|
| Japan|
|Europe|
+------+

I want to build additional "countByValue" column to get table like

+------+-----+
|Origin|Count|
+------+-----+
|Europe|    1|
|   USA|    3|
|   USA|    3|
|   USA|    3|
| Japan|    2|
| Japan|    2|
+------+-----+

I found solution but it seems very inefficient. I group origin dataset and use count function.

Dataset<Row> grouped = origin.groupBy(originCol).agg(functions.count(originCol));
grouped.show();

+------+-----+
|Origin|Count|
+------+-----+
|Europe|    1|
|   USA|    3|
| Japan|    2|
+------+-----+

Then I just join result table with origin dataset.

Dataset<Row> finalDs = origin.join(grouped, originCol);

Is there any other more efficiant way to perform such operation?

Upvotes: 2

Views: 1321

Answers (2)

T. Gawęda
T. Gawęda

Reputation: 16086

You can write query with Window:

origin.withColumn("cnt", count('Origin).over(Window.partitionBy('Origin)))

Remember to import org.apache.spark.sql.functions._ and org.apache.spark.sql.expressions.Window

Upvotes: 4

koiralo
koiralo

Reputation: 23109

This is what you need to do

org.apache.sql.functions._

val df = Seq(
    ("USA"),
    ("Japan"),
    ("USA"),
    ("USA"),
    ("Japan"),
    ("Europe")
).toDF("origin")


val result = df.groupBy("origin").agg(collect_list($"origin").alias("origin1"),
  count("origin").alias("count"))
  .withColumn("origin", explode($"origin1")).drop("origin")

Upvotes: 1

Related Questions