ARM
ARM

Reputation: 1550

Better way to count number of occurrences of unique items?

I have a very long DataArray of strings, and I would like to to generate a DataFrame in which one column is all the unique strings and the second is the number of occurrences. Right now I'm doing something like

using DataFrames
df = DataFrame()
df[:B]=[ "a", "c", "c", "D", "E"]
uniqueB = unique(df[:B])
println(uniqueB)
howMany=zeros(size(uniqueB))
for i=1:size(uniqueB,1)
    howMany[i] = count(j->(j==uniqueB[i]), df[:B])
end
answer = DataFrame()
answer[:Letters] = uniqueB
answer[:howMany] = howMany
answer

but it seems like there should be a much easier way to do this, possibly with a single line. (I know I could also make this a bit faster with somewhat more code by searching the result in each iteration rather than the source.) A possibly related question is here but it doesn't look like hist is overloaded for non-numerical bins. Any thoughts?

Upvotes: 10

Views: 4239

Answers (3)

Merlin
Merlin

Reputation: 1987

DataFrames developer bogumił-kamiński often uses the following nested function to count unique elements.

combine(df, :B => length ∘ unique)

which returns a dataframe. So to get back the value you can:

combine(df, :B => length ∘ unique)[1,1]
    
# or just
length(unique(df.B))

(The function composition symbol is \circ<tab>)

Some more alternates as I search for my own answer:

length(keys(groupby(df, :B))

groupby(df, :B) |> keys |> length

length(union(df[:,:B]))

combine(df, :B => union) |> nrow

maximum(StatsBase.competerank(df.B))

They are all equally verbose, would love a countunique function I guess its easy enough to define one.

Upvotes: 0

Fred
Fred

Reputation: 697

by() function was removed from DataFrames.jl. With current DataFrames v1.5.0 combine(groupby(...), ...) should be used.

julia> df
5×1 DataFrame
 Row │ B
     │ String
─────┼────────
   1 │ a
   2 │ c
   3 │ c
   4 │ D
   5 │ E

julia> combine(groupby(df, :B), nrow)
4×2 DataFrame
 Row │ B       nrow
     │ String  Int64
─────┼───────────────
   1 │ a           1
   2 │ c           2
   3 │ D           1
   4 │ E           1

Upvotes: 2

DSM
DSM

Reputation: 353359

If you want a full frame, you can group by B and call nrow on each group:

julia> by(df, :B, nrow)
4x2 DataFrames.DataFrame
| Row | B   | x1 |
|-----|-----|----|
| 1   | "D" | 1  |
| 2   | "E" | 1  |
| 3   | "a" | 1  |
| 4   | "c" | 2  |

Even outside the DataFrame context, though, you can always use DataStructures.counter rather than reimplementing it yourself:

julia> using DataStructures

julia> counter(df[:B])
DataStructures.Accumulator{ASCIIString,Int32}(Dict("D"=>1,"a"=>1,"c"=>2,"E"=>1))

Upvotes: 11

Related Questions