Reputation: 77
I am aware of combine(groupby(df, :A), nrow=>:count)
to count the number of rows for different :A
. However, what should be the correct implementation to get the number of unique values of :B
for different :A
? Basically, I am looking for the counterpart for R: df %>% group_by(A) %>% summarize(n_unique = n_distinct(B))
. Thanks!
Upvotes: 5
Views: 1676
Reputation: 13800
I think you should be able to do
combine(groupby(df, :A), :B => length ∘ unique => :n_distint_B)
like this:
julia> using DataFrames
julia> df = DataFrame(a = rand(["a", "b"], 20), b = rand(1:5, 20))
20×2 DataFrame
Row │ a b
│ String Int64
─────┼───────────────
1 │ a 3
2 │ b 4
3 │ a 1
4 │ a 1
5 │ b 1
6 │ a 2
7 │ b 4
8 │ a 2
9 │ b 2
10 │ b 1
11 │ b 3
12 │ b 3
13 │ a 4
14 │ a 4
15 │ b 3
16 │ b 2
17 │ a 5
18 │ a 5
19 │ b 5
20 │ a 1
julia> combine(groupby(df, :a), :b => length ∘ unique => :n_distinct_b)
2×2 DataFrame
Row │ a n_distinct_b
│ String Int64
─────┼──────────────────────
1 │ a 5
2 │ b 5
Upvotes: 6
Reputation: 69939
Here is an alternative solution using double groupby
-combine
strategy that is slightly faster:
julia> df = repeat(DataFrame(a = rand(["a", "b"], 20), b = rand(1:5, 20)), 10^6);
julia> @btime combine(groupby($df, :a), :b => length ∘ unique => :n_distinct_b);
650.915 ms (237 allocations: 866.37 MiB)
julia> @btime combine(groupby(combine(groupby($df, [:a, :b]), nrow), :a), nrow => :n_distinct_b);
457.884 ms (346 allocations: 561.20 MiB)
(but what Nils proposed is more natural)
Upvotes: 2