Reputation: 1550
I have a very long DataArray of strings, and I would like to to generate a DataFrame in which one column is all the unique strings and the second is the number of occurrences. Right now I'm doing something like
using DataFrames
df = DataFrame()
df[:B]=[ "a", "c", "c", "D", "E"]
uniqueB = unique(df[:B])
println(uniqueB)
howMany=zeros(size(uniqueB))
for i=1:size(uniqueB,1)
howMany[i] = count(j->(j==uniqueB[i]), df[:B])
end
answer = DataFrame()
answer[:Letters] = uniqueB
answer[:howMany] = howMany
answer
but it seems like there should be a much easier way to do this, possibly with a single line. (I know I could also make this a bit faster with somewhat more code by searching the result in each iteration rather than the source.) A possibly related question is here but it doesn't look like hist is overloaded for non-numerical bins. Any thoughts?
Upvotes: 10
Views: 4239
Reputation: 1987
DataFrames developer bogumił-kamiński often uses the following nested function to count unique elements.
combine(df, :B => length ∘ unique)
which returns a dataframe. So to get back the value you can:
combine(df, :B => length ∘ unique)[1,1]
# or just
length(unique(df.B))
(The function composition symbol is \circ<tab>
)
Some more alternates as I search for my own answer:
length(keys(groupby(df, :B))
groupby(df, :B) |> keys |> length
length(union(df[:,:B]))
combine(df, :B => union) |> nrow
maximum(StatsBase.competerank(df.B))
They are all equally verbose, would love a countunique
function I guess its easy enough to define one.
Upvotes: 0
Reputation: 697
by() function was removed from DataFrames.jl. With current DataFrames v1.5.0 combine(groupby(...), ...) should be used.
julia> df
5×1 DataFrame
Row │ B
│ String
─────┼────────
1 │ a
2 │ c
3 │ c
4 │ D
5 │ E
julia> combine(groupby(df, :B), nrow)
4×2 DataFrame
Row │ B nrow
│ String Int64
─────┼───────────────
1 │ a 1
2 │ c 2
3 │ D 1
4 │ E 1
Upvotes: 2
Reputation: 353359
If you want a full frame, you can group by B and call nrow
on each group:
julia> by(df, :B, nrow)
4x2 DataFrames.DataFrame
| Row | B | x1 |
|-----|-----|----|
| 1 | "D" | 1 |
| 2 | "E" | 1 |
| 3 | "a" | 1 |
| 4 | "c" | 2 |
Even outside the DataFrame context, though, you can always use DataStructures.counter
rather than reimplementing it yourself:
julia> using DataStructures
julia> counter(df[:B])
DataStructures.Accumulator{ASCIIString,Int32}(Dict("D"=>1,"a"=>1,"c"=>2,"E"=>1))
Upvotes: 11