Reputation: 1061
i've following dataset :
col1_id, col2_id, type
1 t1_1 t1
1 t1_2 t1
2 t2_2 t2
col1_id & col2_id
have one to many relationship i.e. multiple col2_id
can have same col1_id
value
type (eg. t1
) is derived from col2_id
Objective is to find number of col1_id
having a type (i.e. t1
, t2
etc)
Here is what i'm doing currently,
val df1 = df.select($"col1_id", $"type").groupBy($"col1_id", $"type").count()
df1.drop($"count").groupBy($"type").show()
this works fine .. however i'm wondering if there might be a better way to accomplish this. Pls let me know.
Upvotes: 0
Views: 133
Reputation: 27373
Not sure why you mention col2_id
, it does not play a role here?
I expect what you want to do is to count the distinct col1_id
values per type? If yes, then do :
import org.apache.spark.sql.functions.countDistinct
df
.groupBy($"type")
.agg(
countDistinct($"col1_id")
)
.show()
Upvotes: 1