Count the number of unique values by group

Question

I am aware of combine(groupby(df, :A), nrow=>:count) to count the number of rows for different :A. However, what should be the correct implementation to get the number of unique values of :B for different :A? Basically, I am looking for the counterpart for R: df %>% group_by(A) %>% summarize(n_unique = n_distinct(B)). Thanks!

Nils Gudat · Accepted Answer

I think you should be able to do

combine(groupby(df, :A), :B => length ∘ unique => :n_distint_B)

like this:

julia> using DataFrames

julia> df = DataFrame(a = rand(["a", "b"], 20), b = rand(1:5, 20))
20×2 DataFrame
 Row │ a       b     
     │ String  Int64 
─────┼───────────────
   1 │ a           3
   2 │ b           4
   3 │ a           1
   4 │ a           1
   5 │ b           1
   6 │ a           2
   7 │ b           4
   8 │ a           2
   9 │ b           2
  10 │ b           1
  11 │ b           3
  12 │ b           3
  13 │ a           4
  14 │ a           4
  15 │ b           3
  16 │ b           2
  17 │ a           5
  18 │ a           5
  19 │ b           5
  20 │ a           1

julia> combine(groupby(df, :a), :b => length ∘ unique => :n_distinct_b)
2×2 DataFrame
 Row │ a       n_distinct_b 
     │ String  Int64        
─────┼──────────────────────
   1 │ a                  5
   2 │ b                  5

Count the number of unique values by group

Answers (2)

Related Questions