Antonello
Antonello

Reputation: 6431

How to produce grouped summary statistics without explicitly naming the variables

Given a Julia dataframe with many variables and a final class column:

julia> df
5×3 DataFrame
 Row │ v1       v2     cl    
     │ Int64?   Int64  Int64 
─────┼───────────────────────
   1 │      10      1      2
   2 │      20      2      2
   3 │     300     10      1
   4 │     400     20      1
   5 │ missing     30      1

I want to obtain a grouped df with summary statistics by class, with the classes by column and variables by rows, like:

julia> dfByCl
11×3 DataFrame
 Row │ var      cl_1       cl_2      
     │ String   Float64?   Float64?  
─────┼───────────────────────────────
   1 │ nrow       3.0       2.0
   2 │ v1_mean  350.0      15.0
   3 │ v1_std    70.7107    7.07107
   4 │ v1_lb    252.002     5.20018
   5 │ v1_ub    447.998    24.7998
   6 │ v1_nm      2.0       2.0
   7 │ v2_mean   20.0       1.5
   8 │ v2_std    10.0       0.707107
   9 │ v2_lb      8.68414   0.520018
  10 │ v2_ub     31.3159    2.47998
  11 │ v2_nm      3.0       2.0

and I don't want to explicitly name all the variables. Is there something simpler/more elegant than the above code ?

using Statistics, DataFrames, Distributions

meansk(data) = mean(skipmissing(data))
stdsk(data)  = std(skipmissing(data))
nm(data)     = sum(.! ismissing.(data))
ci(data::AbstractVector,α=0.05) = meansk(data) - quantile(Normal(),1-α/2)*stdsk(data)/sqrt(nm(data)), meansk(data) + quantile(Normal(),1-α/2)*stdsk(data)/sqrt(nm(data))
cilb(data) = ci(data)[1]
ciub(data) = ci(data)[2]

df = DataFrame(v1=[10,20,300,400,missing],v2=[1,2,10,20,30],cl=[2,2,1,1,1])

dfByCl_w = combine(groupby(df,["cl"]),
   nrow,
   names(df) .=> meansk .=> col -> col * "_mean",
   names(df) .=> stdsk  .=> col -> col * "_std",
   names(df) .=> cilb   .=> col -> col * "_lb",
   names(df) .=> ciub   .=> col -> col * "_ub",
   names(df) .=> nm     .=> col -> col * "_nm",
)
orderedNames = vcat("cl","nrow",[ ["$(n)_mean", "$(n)_std", "$(n)_lb", "$(n)_ub", "$(n)_nm"] for n in names(df)[1:end-1]]...)
dfByCl_w     = dfByCl_w[:, orderedNames]
toStack      = vcat("nrow",[ ["$(n)_mean", "$(n)_std", "$(n)_lb", "$(n)_ub", "$(n)_nm"] for n in names(df)[1:end-1]]...)
dfByCl_l     = stack(dfByCl_w,toStack)
dfByCl       = unstack(dfByCl_l,"cl","value")
rename!(dfByCl,vcat("var",["cl_$(c)" for c in unique(dfByCl_w.cl)]))

Upvotes: 2

Views: 93

Answers (1)

Bogumił Kamiński
Bogumił Kamiński

Reputation: 69819

Here is what I would normally do in such a case:

julia> cilb(data,α=0.05) = mean(data) - quantile(Normal(),1-α/2)*std(data)/sqrt(count(x -> true, data))
cilb (generic function with 2 methods)

julia> ciub(data,α=0.05) = mean(data) + quantile(Normal(),1-α/2)*std(data)/sqrt(count(x -> true, data))
ciub (generic function with 2 methods)

julia> combine(groupby(df, :cl),
               nrow,
               sdf -> describe(sdf, :mean, :std, cilb => :lb, ciub => :ub, :nmissing, cols=r"v"))
4×8 DataFrame
 Row │ cl     nrow   variable  mean     std        lb          ub         nmissing
     │ Int64  Int64  Symbol    Float64  Float64    Float64     Float64    Int64
─────┼─────────────────────────────────────────────────────────────────────────────
   1 │     1      3  v1          350.0  70.7107    252.002     447.998           1
   2 │     1      3  v2           20.0  10.0         8.68414    31.3159          0
   3 │     2      2  v1           15.0   7.07107     5.20018    24.7998          0
   4 │     2      2  v2            1.5   0.707107    0.520018    2.47998         0

Later you can reshape it as you like, but maybe this layout is something you would actually want?

Upvotes: 2

Related Questions