merchmallow
merchmallow

Reputation: 794

Julia Groupby with mean calculation

I have this dataframe:

d=DataFrame(class=["A","A","A","B","C","D","D","D"],
            num=[10,20,30,40,20,20,13,12], 
            last=[3,5,7,9,11,13,100,12])

and I want to do a groupby. In Python I would do:

d.groupby('class')[['num','last']].mean()

How can I do the same in Julia?

I am trying something to use combine and groupby but no success so far.

Update: I managed to do it this way:

gd = groupby(d, :class)
combine(gd, :num => mean, :last => mean)

Is there any better way to do it?

Upvotes: 1

Views: 1831

Answers (1)

Bogumił Kamiński
Bogumił Kamiński

Reputation: 69949

It depends what you mean by "a better way". You can apply the same function to multiple columns like this:

combine(gd, [:num, :last] .=> mean)

or if you had a lot of columns and e.g. wanted to apply mean to all columns exept a grouping column you could do:

combine(gd, Not(:class) .=> mean)

or (if you want to avoid having to remember which column was grouping)

combine(gd, valuecols(gd) .=> mean)

These are basic schemas. Now the other issue is how to give a name to your target columns. By default they get a name in a form "source_function" like this:

julia> combine(gd, [:num, :last] .=> mean)
4×3 DataFrame
 Row │ class   num_mean  last_mean
     │ String  Float64   Float64
─────┼─────────────────────────────
   1 │ A           20.0     5.0
   2 │ B           40.0     9.0
   3 │ C           20.0    11.0
   4 │ D           15.0    41.6667

you can keep original column names like this (this is sometimes preferred):

julia> combine(gd, [:num, :last] .=> mean, renamecols=false)
4×3 DataFrame
 Row │ class   num      last
     │ String  Float64  Float64
─────┼──────────────────────────
   1 │ A          20.0   5.0
   2 │ B          40.0   9.0
   3 │ C          20.0  11.0
   4 │ D          15.0  41.6667

or like this:

julia> combine(gd, [:num, :last] .=> mean .=> identity)
4×3 DataFrame
 Row │ class   num      last
     │ String  Float64  Float64
─────┼──────────────────────────
   1 │ A          20.0   5.0
   2 │ B          40.0   9.0
   3 │ C          20.0  11.0
   4 │ D          15.0  41.6667

The last example shows you that you can pass any function as the last part that works on strings and generates you target column name, so you can do:

julia> combine(gd, [:num, :last] .=> mean .=> col -> "prefix_" * uppercase(col) * "_suffix")
4×3 DataFrame
 Row │ class   prefix_NUM_suffix  prefix_LAST_suffix
     │ String  Float64            Float64
─────┼───────────────────────────────────────────────
   1 │ A                    20.0              5.0
   2 │ B                    40.0              9.0
   3 │ C                    20.0             11.0
   4 │ D                    15.0             41.6667

Edit

Doing the operation in a single line:

You can do just:

combine(groupby(d, :class), [:num, :last] .=> mean)

The benefit of storing groupby(d, :class) in a variable is that you perform grouping once and then can reuse the resulting object many times, which speeds up things.

Also if you use DataFrmesMeta.jl you could write e.g.:

@chain d begin
    groupby(:class)
    combine([:num, :last] .=> mean)
end

which is more typing, but this is style that people coming from R tend to like.

Upvotes: 8

Related Questions