Spark Scala GroupBy

Question

i've following dataset :

col1_id,    col2_id,       type

1           t1_1                t1 
1           t1_2                t1
2           t2_2                t2

col1_id & col2_id have one to many relationship i.e. multiple col2_id can have same col1_id value type (eg. t1) is derived from col2_id

Objective is to find number of col1_id having a type (i.e. t1, t2 etc)

Here is what i'm doing currently,

val df1 = df.select($"col1_id", $"type").groupBy($"col1_id", $"type").count()
df1.drop($"count").groupBy($"type").show()

this works fine .. however i'm wondering if there might be a better way to accomplish this. Pls let me know.

Raphael Roth · Accepted Answer

Not sure why you mention col2_id, it does not play a role here?

I expect what you want to do is to count the distinct col1_id values per type? If yes, then do :

import org.apache.spark.sql.functions.countDistinct

df
  .groupBy($"type")
  .agg(
    countDistinct($"col1_id")
  )
  .show()

Answers (1)