imantha
imantha

Reputation: 3828

Julia DataFrames.jl, Groupby and summing multiple columns

I am wondering how to use the "by" function to group by and sum multiple columns. If I wanted to groupby one column I can do it by this

someData = DataFrame(:Countries => ["Afganistan","Albainia","Albainia","Andorra","Angola","Angola"],:population => rand(100:1000,6), :GDP => rand(1:100,6))

by(someData, :Countries, df ->DataFrame(pop_sum = sum(df[:population])))

However, I wanted to get the sum of both population and GDP. I tried something like below which is of course incorrect. Any ideas?

by(someData, :Countries, df ->DataFrame(pop_sum, GDP_sum = sum(df[[:population,:GDP]])))

Upvotes: 3

Views: 2284

Answers (1)

Bogumił Kamiński
Bogumił Kamiński

Reputation: 69869

Do not use by function as it is deprecated. Instead use this (you do not see the warning as probably you are starting Julia with --depwarn set to no which is the default):

julia> someData = DataFrame(:Countries => ["Afganistan","Albainia","Albainia","Andorra","Angola","Angola"],
                            :population => rand(100:1000,6),
                            :GDP => rand(1:100,6))
6×3 DataFrame
│ Row │ Countries  │ population │ GDP   │
│     │ String     │ Int64      │ Int64 │
├─────┼────────────┼────────────┼───────┤
│ 1   │ Afganistan │ 543        │ 29    │
│ 2   │ Albainia   │ 853        │ 71    │
│ 3   │ Albainia   │ 438        │ 81    │
│ 4   │ Andorra    │ 860        │ 88    │
│ 5   │ Angola     │ 940        │ 64    │
│ 6   │ Angola     │ 688        │ 40    │

julia> combine(groupby(someData, :Countries), [:population, :GDP] .=> sum)
4×3 DataFrame
│ Row │ Countries  │ population_sum │ GDP_sum │
│     │ String     │ Int64          │ Int64   │
├─────┼────────────┼────────────────┼─────────┤
│ 1   │ Afganistan │ 543            │ 29      │
│ 2   │ Albainia   │ 1291           │ 152     │
│ 3   │ Andorra    │ 860            │ 88      │
│ 4   │ Angola     │ 1628           │ 104     │

The alternative way to write it would be:

julia> combine(groupby(someData, :Countries)) do sdf
       return (population_sum = sum(sdf.population), GDP_sum=sum(sdf.GDP))
       end
4×3 DataFrame
│ Row │ Countries  │ population_sum │ GDP_sum │
│     │ String     │ Int64          │ Int64   │
├─────┼────────────┼────────────────┼─────────┤
│ 1   │ Afganistan │ 543            │ 29      │
│ 2   │ Albainia   │ 1291           │ 152     │
│ 3   │ Andorra    │ 860            │ 88      │
│ 4   │ Angola     │ 1628           │ 104     │

but it is more verbose in this case (it would be useful if you wanted to do more complex preprocessing of the data before returning the value).

Upvotes: 7

Related Questions