Reputation: 3828
I am wondering how to use the "by" function to group by and sum multiple columns. If I wanted to groupby one column I can do it by this
someData = DataFrame(:Countries => ["Afganistan","Albainia","Albainia","Andorra","Angola","Angola"],:population => rand(100:1000,6), :GDP => rand(1:100,6))
by(someData, :Countries, df ->DataFrame(pop_sum = sum(df[:population])))
However, I wanted to get the sum of both population and GDP. I tried something like below which is of course incorrect. Any ideas?
by(someData, :Countries, df ->DataFrame(pop_sum, GDP_sum = sum(df[[:population,:GDP]])))
Upvotes: 3
Views: 2284
Reputation: 69869
Do not use by
function as it is deprecated. Instead use this (you do not see the warning as probably you are starting Julia with --depwarn
set to no
which is the default):
julia> someData = DataFrame(:Countries => ["Afganistan","Albainia","Albainia","Andorra","Angola","Angola"],
:population => rand(100:1000,6),
:GDP => rand(1:100,6))
6×3 DataFrame
│ Row │ Countries │ population │ GDP │
│ │ String │ Int64 │ Int64 │
├─────┼────────────┼────────────┼───────┤
│ 1 │ Afganistan │ 543 │ 29 │
│ 2 │ Albainia │ 853 │ 71 │
│ 3 │ Albainia │ 438 │ 81 │
│ 4 │ Andorra │ 860 │ 88 │
│ 5 │ Angola │ 940 │ 64 │
│ 6 │ Angola │ 688 │ 40 │
julia> combine(groupby(someData, :Countries), [:population, :GDP] .=> sum)
4×3 DataFrame
│ Row │ Countries │ population_sum │ GDP_sum │
│ │ String │ Int64 │ Int64 │
├─────┼────────────┼────────────────┼─────────┤
│ 1 │ Afganistan │ 543 │ 29 │
│ 2 │ Albainia │ 1291 │ 152 │
│ 3 │ Andorra │ 860 │ 88 │
│ 4 │ Angola │ 1628 │ 104 │
The alternative way to write it would be:
julia> combine(groupby(someData, :Countries)) do sdf
return (population_sum = sum(sdf.population), GDP_sum=sum(sdf.GDP))
end
4×3 DataFrame
│ Row │ Countries │ population_sum │ GDP_sum │
│ │ String │ Int64 │ Int64 │
├─────┼────────────┼────────────────┼─────────┤
│ 1 │ Afganistan │ 543 │ 29 │
│ 2 │ Albainia │ 1291 │ 152 │
│ 3 │ Andorra │ 860 │ 88 │
│ 4 │ Angola │ 1628 │ 104 │
but it is more verbose in this case (it would be useful if you wanted to do more complex preprocessing of the data before returning the value).
Upvotes: 7